 Research
 Open access
 Published:
Task offloading in hybriddecisionbased multicloud computing network: a cooperative multiagent deep reinforcement learning
Journal of Cloud Computing volumeÂ 11, ArticleÂ number:Â 90 (2022)
Abstract
Multicloud computing is becoming a promising paradigm to provide abundant computation resources for InternetofThings (IoT) devices. For a multidevice multicloud network, the realtime computing requirements, frequently varied wireless channel gains and changeable network scale, make the system more dynamic. It is critical to satisfy the dynamic nature of network with different constraints of IoT devices in multicloud environment. In this paper, we establish a continuousdiscrete hybrid decision offloading model, each device should learn to make coordinated actions, including cloud server selection, offloading ratio and local computation capacity. Therefore, both continuousdiscrete hybrid decision and coordination among IoT devices are challenging. To this end, we first develop a probabilistic method to relax the discrete action (e.g. cloud server selection) to a continuous set. Then, by leveraging a centralized training and distributed execution strategy, we design a cooperative multiagent deep reinforcement learning (CMADRL) based framework to minimize the total system cost in terms of the energy consumption of IoT device and the renting charge of cloud servers. Each IoT device acts as an agent, which not only learns efficient decentralized policies, but also relieves IoT devicesâ€™ computing pressure. Experimental results demonstrate that the proposed CMADRL could efficiently learn dynamic offloading polices at each IoT device, and significantly outperform the four stateoftheart DRL based agents and two heuristic algorithms with lower system cost.
Introduction
With the development of mobile communication networks, the number of InternetofThings (IoT) devices, such as smartphones, wearables, and sensors, has a rapid growth. Moreover, the new advanced applications with computationintensive tasks are emerging [1,2,3]. However, IoT devices usually have limited computation, battery and communication capacity [4, 5]. To address the conflict between computationintensive tasks and resourcelimited IoT devices, cloud computing has been considered as an emerging paradigm [6, 7], which supports IoT devices to offload some computation tasks to the cloud servers with sufficient computation capability.
Nevertheless, it is still challenging for IoT devices to acquire satisfactory computation services [8]. On one hand, there may be a large number of IoT devices require computation intensive services simultaneously. With limited storage and computation resources, it will be hard for a single cloud server to provide its computation services, especially in hotspot scenario [9]. On the other hand, IoT application relies only on a single cloud server, which increases the risk of cloud server lockin. Thus, it is more promising to study the scenario with multicloud collaboration. Although multicloud computing technology maintains satisfactory service requirements of IoT applications, it is still challenging to achieve efficient utilization of computation resources for service charging multicloud system. Hence, it is nontrivial to investigate the task offloading mechanism in multicloud networks.
So far, many researchers have dedicated to design computing offloading policies. For the case of static optimization, some strategies are proposed in [10,11,12]. In [10], the authors studied a problem of multicloud systems faulttolerant workflow scheduling, and proposed a faulttolerant costefficient workflow scheduling algorithm based on mathematic method to improve the scientific applications execution reliability and reduce their execution cost, respectively. Besides, antcolonybased optimization technique was applied in [11] to derive optimal coalition of virtual machines (VMs), and then a firstprice sealedbid auction game was used to allocate and migrate the VMs for multicloud environments, where federation profit was improved at the expense of increased latency. Further, Alnoman et al. [12] applied dynamic programming and exhaustive search approach to jointly optimize the power consumption, cloud response time and user energy in heterogeneous cloud radio access cloudedge networks. The traditional offloading strategies often require complete and accurate network information, which is difficult to obtain in real networks due to highly dynamic multicloud networks. Besides, for largescale dynamic environment, some traditional approaches,normally need a considerable amount of iterations to achieve a satisfying local optimum. Meanwhile, the computation complexity of the abovementioned traditional solutions increases significantly, which makes them very difficult to be suitable for dynamic environment.
Since deep reinforcement learning (DRL) based methods could make intelligent decision with no prior knowledge through exploring the dynamic network environments [13, 14]. Recently, some researchers apply DRL methods to deal with decision optimization problem in multicloud networks [15,16,17,18,19,20,21]. In [15], an asynchronous advantage actor critic (A3C) and residual recurrent neural network (R2N2) based scheduler were investigated for heterogeneous edgecloud environment to obtain optimal energy consumption, response time, ServiceLevelAgreement and running cost. Zhang et al. [16] considered a threelayer distributed multicloud multiaccess edge, and proposed a multiagent reinforcement learning to make task offloading and resource allocation strategy. Zhao et al. [17] studied a scheduling policy with DRL in a hybrid multicloud environment to maximize renewable energy utilization. In [18], a deep Qnetwork (DQN) based collaborative task placement algorithm was proposed to optimize system utility. In [19], by combining multiple parallel deep neural networks (DNNs) with Qlearning, a deep meta reinforcement learningbased offloading (DMRO) algorithm is applied to migrate complex tasks from IoT devices to edgecloud servers. Chen et al. [20] proposed a multiple buffer deep deterministic policy gradient (MBDDPG) to learn preferable microservicebased service deployment strategy, and improve the average waiting time. Chen et al. [21] investigated the longterm dynamic task allocation and service migration (DTASM) problem in edgecloud IoT systems, twindelayed deep deterministic policy gradient (DDPG) was proposed to minimize the total computing load forwarded to the cloud server while satisfying the seamless service migration constraint.
However, these above methods modeled in either a discrete or a continuous action space, which restricted the optimization of offloading decisions in limited action space. In reality, the action space of offloading problem is generally continuousdiscrete hybrid [22]. The agent should decide continuous actions (e.g., offloading ratio or local computation capacity) and discrete (e.g., whether to offload or which cloud server to select) actions to execute offloading computation. Thus, these methods may not perform well when the action space becomes large. On the other hand, if the number of IoT devices or cloud servers is large, the state and action may grow exponentially, which results in a serious performance of convergence and generalizability degradation.
To tackle these problems, this paper investigates hybriddecisionbased collaborative multicloud system, where multiple cloud servers are designed to offload computation tasks of IoT devices under timevarying wireless channels and task arrivals. The task offloading optimization problem is formulated to minimize the total system cost in terms of energy consumption of IoT devices and renting charge of cloud servers. Particularly, the decision of each IoT device is interdependent in the hybriddecisionbased multicloud environments. To solve the issues of hybrid decision and collaboration among different devices, we address the issues in two steps. To be specific, we first relax discrete action (e.g. cloud server selection) into a continuous set by designing a probabilistic method. Then, a cooperative multiagent DRL (CMADRL) [23] based framework, which employs centralized training process and distributed execution strategy, is designed to obtain the optimal cloud server selection, offloading ratio and local computation capacity. The major contributions of our work are the following:

We establish a computation offloading framework for multiple IoT devices with multicloud, where the task arrivals, channel gains and computation capacity of cloud servers are timevarying. The dynamic computation offloading problem is formulated to minimize the total system cost of energy consumption and renting charge, by jointly designing the cloud server selection, offloading ratio and local computation capacity.

We relax discrete decision (i.e. cloud server selection) into a continuous set by designing a probabilistic method. Thus, the continuousdiscrete hybrid decision is transformed as a continuous decision. Then, we design a novel CMADRL framework with each IoT device acting as an agent to stabilize the training and alleviate ondevice computational burden. That is to say, we use global state information collecting at the proxy server to train a locally observable policy function for each IoT device.

We conduct extensive simulations to evaluate the performance of the proposed CMADRL. The results demonstrate the superiority of the proposed algorithm by comparing with four state of art DRLbased frameworks and two heuristic algorithms, especially in terms of flexibility to the change of currently processed task, adaptability to the variation of communication resources, and generalizability to the extension of network scale.
The remainder of this paper is organized as follows. The system model and problem formulation are provided in â€śSystem model and problem formulationâ€ť section. The proposed CMADRL is introduced in â€śThe proposed CMADRLâ€ť section. Simulation results section analyzes and discusses the experimental results, and â€śConclusions and future workâ€ť section concludes this paper.
System model and problem formulation
In this section, we consider a multicloud computing system consisting of M cloud servers, base station (BS), a proxy server, and N IoT devices. Each IoT device can communicate with the BS with a wireless link, whereas the BS and cloud servers are connected by a wired link. The proxy server deploying near BS plays a role of training equipment, which assists the BS for centralized training and will be explained in detail in â€śMultiagent DRL frameworkâ€ť section. As shown in Fig.Â 1, a set of cloud servers \(\mathcal {M}\) = \(\{1,\ldots ,M\}\) can provide offloading computing services for a set of IoT devices \(\mathcal {N}\) = \(\{1,\ldots ,N\}\). Without loss of generality, we assume each IoT device n \(\in\) \(\mathcal {N}\) maintains a computationintensive task to be processed during each time slot t \(\in\) \(\mathcal {T}\), where \(\mathcal {T}\) = \(\{1,\ldots ,T\}\). We assume each task data are finegrained and can be partitioned into subsets of any size [24]. Namely one part to be executed on IoT device n, and the other to be offloaded to one of the cloud servers m \(\in\) \(\mathcal {M}\) for remotely processing.
Let \(a_n^t\) denote offloading ratio, which can be consider as the percentage of the taskâ€™s data size (in bit) to be offloaded to the cloud server, satisfying \(a_n^t\) \(\in\) [0,1]. Let \(f_n^t\) and \(F_n^{max}\) denote local and maximum computation capacity, which can be viewed as the CPUcycle frequency to process the task data. We assume local computation capacity \(f_n^t\) \(\in\) [0,\(F_n^{max}\)] is flexibly controlled via chip voltage adjustment using the dynamic voltage and frequency scaling (DVFS) technique [25].
At each time slot t, IoT device n needs to decide which cloud server \(m_n^t\) to offloading the task, then offloads \(\alpha _n^t\) parts of the task data to the cloud server \(m_n^t\) for remote computing. Meanwhile, IoT device n executes the remaining 1\(\alpha _n^t\) parts of the task data locally. In other words, the task offloading need to be consider three decisions, including cloud server select \(m_n^t\), offloading ratio \(\alpha _n^t\), the local computation capacity \(f_n^t\).
Since both energy consumption and renting charge play a significant role in the performance evaluation of computation offloading for IoT devices, we consider these two objectives as total system cost. Following illustrates the detailed operation of task queue, local computing, offloading computing and problem formulation, respectively.
Task queue model
We adopt a task queue to represent this dynamic nature of the multicloud system. Due to the task may fail to be completed with the limited computation resource of IoT devices, the task execution result in current time slot is relevant to the task load in next time slot. Specifically, computation tasks of IoT device n in time slot t are denoted as \(Tas_n^t\)={\(Ld_n^t\),\(d_n^t\),\(\bar{D}_n^t\),\(c_n^t\)}, in which \(Ld_n^t\),\(d_n^t\),\(\bar{D}_n^t\) and \(c_n^t\) indicate the size of computation data in the task queue (in bits), the currently processed task data size (in bits), the maximum tolerable delay, and required computational resources to complete the whole task (in CPU cycles/bit) [26]. In addition, we update \(Ld_n^{t+1}\) with the residual task in current time slot and the new arrived task in next time slot, which is given as
where \([x]^+\)=max(x,0), and \(\breve{d}_n^{t+1}\) is a new arrived task generated in next time slot \(t+1\). \(drop_n^t\) is a bool variable, \(drop_n^t\)=False represents the task of IoT device n in time slot t is processed successfully.
Local computing
For partial task data \((1\alpha _n^t )\cdot d_n^t\), the processing delay \(D_n^{loc} (t)\) and energy consumption \(E_n^{loc}(t)\) incurred on IoT device n, are given as [27]
where \(\kappa\) is an effectively switching capacitance constant.
Offloading computing
To take advantage of the rich computation resources of the multicloud servers, computation offloading involves three step. Firstly, the IoT device n offloads the partial task data to an appropriate cloud server m for remote execution. Then, cloud servers handle tasks offloaded from IoT device. Finally, cloud server returns the task execution results to the device. Specifically, computation offloading incurs both transmission delay and energy consumption between the IoT device and the selected cloud server. In this paper, we simplify the system model and assume that the size of task execution results obtained from the cloud server is small [28], therefore the transmission delay and energy consumption of feedback transmission are negligible compared with that for local computing of offloading. Moreover, we assume the multicloud server connected to the BS via optic fiber or copper wires, so we ignore the transmission delay between the BS and the selected cloud server.
In order to eliminate wireless channel interference among IoT devices, similar to [29], orthogonal frequency division multiple access (OFDMA) is adopted as a multiple access technology. Thus, the system bandwidth W can be divided into equivalent subbands distributed to each IoT device equally. According the Shannon formula, the uplink transmission rate for IoT device n to cloud server m, \(v_{n,m}^{tra} (t)\) is
where \(P_{n,m}^t\) , \(g_{n,m}^t\) and \(\varrho ^2\) are the transmission power, channel gain and noise power in time slot t, respectively.
The transmission delay and energy consumption incurred for offloading the partial input data \(\alpha _{n}^t \cdot d_{n}^t\), from IoT device n to the cloud server m, \(D_{n,m}^{tra} (t)\) and \(E_{n,m}^{tra} (t)\), are
According to Eq. (3), and Eq. (6), total energy consumption of IoT device n for processing the task data \(d_n^t\) , \(E_{n}^t\), are formulated as
The processing delay for computing task in cloud server depends on the task data size and cloud serverâ€™s computation capacity. For each cloud server, due to one cloud server may also support the computation requests from other IoT devices, its computation capability is time varying. Therefore, we assume the cloud serverâ€™s remaining computation capacity varies randomly between different time slots but keep fixed in each time slot. Let \(f_{occup}^t = \mathbf {Pr} (\varepsilon ) \cdot f_{ser}^{unit}\) be the occupied computation resource, which is modeled as an i.i.d. Possion process with parameter \(\varepsilon\), where \(f_{ser}^{unitis}\) is the occupied computation resource for each unit [22]. The computation capacity of server m, \(f_m^t\), can be defined as \(f_m^t=F_{ser}^{max}f_{occup}^t\), where \(F_{ser}^{max}\) is the maximum computation capacity of cloud server m \(\in\) \(\mathcal {M}\).
When processing the partial task data \(\alpha _n^t \cdot d_n^t\) on cloud server m, the incurred execution delay can be expressed as
The executing delay \(D_n^m (t)\) brings server charge to an IoT device for a server provider in cloud computing. In other words, due to the IoT device rents the computing resources of the cloud server to execute the task, the cloud computing provider will charge the IoT device. Let \(c(f_m^t)=e^{(\eta )}\cdot (f_m^t1) \cdot \beta\) denote price per time unit at computing capability \(f_m^t\) [30], where \(\eta\) and \(\beta\) are two coefficients. Therefore, the service charge \(C_n^m (t)\) required to execute IoT device nâ€™s partial task on cloud server m is obtain by
Problem formulation
The dynamic computing offloading problem concerned is to minimize the total system cost of the energy consumption \(E_n^t\) and the renting charge \(C_n^m (t)\) in the long term, as formulated in Eq. (10).
s.t.
where \(\omega _1\) and \(\omega _2\) are the tradeoff weight. Constraint (10a) defines that for an arbitrary task, its actual task completion time cannot exceed its associated maximum tolerable delay. Constraint (10b) defines that for each IoT device, its task can be offloaded to only one of cloud servers. Constraint (10c) specifies offloading ratio is a variable between 0 and 1 for each task. Constraint (10d) states that for each IoT device, the local computation capacity cannot exceed its associated maximum computation capacity. Note that, \(m_n^t\), \(\alpha _n^t\) and \(f_n^t\) are the continuediscrete hybrid decision variables associated with IoT device n, where \(\alpha _n^t\) and \(f_n^t\) is continue variable, and \(m_n^t\) is discrete variable.
Generally, the objective function and constraints in Eq. (10) are nonconvex, and the challenges of this dynamic computation offloading problem lies in three aspects: (1) the decision process contains both continuous decisions and discrete decision; (2) the decision of IoT devices is highly dynamic with the large solution space; (3) the optimal offloading strategy should coordinate among IoT devices. Therefore, it is intractable to find optimal policies through traditional optimizationbased schemes.
The proposed CMADRL
In this section, we first relax the continuediscrete decision variable to continue decision variable. Then, we model a multiagent Markov Decision Process (MDP) for task offloading optimal problem. Finally, the procedure of cooperative twin delayed DDPG (CMATD3) is introduced in detail.
Discrete decision variable relaxation
To address these challenges in Eq. (10), for the discrete decision variable \(m_n^t\) , we adopt a probabilistic method to convert it as a continuous variable. In particular, let \(Pro(m_n^t ) \in [\frac{m1}{M} ,\frac{m}{M}]\) be the probability of the task offloads to cloud server m. In other words, if the IoT device n chooses a continue decision variable \(Pro(m_n^t )\), which can be considered as the IoT device n selects cloud server m to offload its task at time slot t. Taking the number of cloud servers \(M = 5\) as an example, if the \(Pro(m_n^t ) \in \left[\frac{1}{5} ,\frac{2}{5}\right]\), we can choose server \(m = 2\) to offload the task. Therefore, the total system cost minimization problem can be reformulated as follows,
s.t.
Based on this setting, DRL agent will be introduced to solve this dynamic offloading problem. For centralized decision making DRL, which require the BS to collect the environment state from all IoT devices. However, this would increase the communication overhead. Therefore, this paper aims at obtaining promising computing offloading solutions with multiagent settings.
Since the network environment is nonstationarity, other agents change their policy in the training process, this leads to the performance of traditional multiagent DRL becomes unstable. In order to guarantee convergence, we design a cooperative multiagent deep reinforcement learning based framework, which leverages the strategy of centralized training and distributed execution by using locally executable actor networks and fully observable critic networks [31].
MDP Formulation
We model the task offloading optimal problem as a multiagent Markov decision process (MDP). The multiagent MDP can be denoted by a 4 tuple \((\mathcal {N} , S_n, A_n, r_n)\), where \(\mathcal {N}\) is the agent space, \(S_n\) is the state space of agent n, \(A_n\) is the action space of agent n, and \(r_n\) is the reward function of agent n, respectively.

Agent space \(\mathcal {N}:\mathcal {N} = \{1,\ldots ,N\}\), where N is the number of IoT devices, each IoT device acts as an agent. For agent n, \(n = 1, 2, \ldots , N\), by determining the server selection \(Pro(m_n^t)\), offloading ratio \(\alpha _n^t\), and local computation capacity \(f_n^t\), the agent can obtain the minimum total system cost.

State space \(S_n\): For IoT device n, the state \(\mathbf {s}_n^t\) is composed of computation task, the channel gain between IoT device n and BS in time slot t, and computation capacity of all cloud servers in the multicloud system.
$$\begin{aligned} \mathbf {s}_n^t=\Big \{Tas_n^t;g_{n}^t;f_1^t,\cdots ,f_m^t\Big \} \end{aligned}$$(12) 
Action space \(A_n\): Since each IoT device is required to determine probability of its selected cloud server \(Pro(m_n^t)\), offloading ratio \(\alpha _n^t\) and local computing capability \(f_n^t\), the action space can be given by
$$\begin{aligned} \mathbf {a}_n^t=\Big \{Pro(m_n^t),\alpha _n^t,f_n^t\Big \} \end{aligned}$$(13) 
Reward function \(r_n\): To obtain the nearoptimal policy for the task offloading optimization problem in Eq. (11), the numbers of agent should cooperate to minimize the total system cost. In other words, reward function \(r_n^t\) is set to instruct agent working at IoT device n to learn to make decisions that satisfy the constraints. The action is successful if the decision variables corresponding to the action do not violate any of the constraints defined in Eqs. (11a)  (11d), then the reward is defined as the product of the reciprocal of the weight sum \((\omega _1 \cdot E_n^t+\omega _2 \cdot C_n^m (t))\) and a constant \(C_1\). Otherwise, the reward is defined as a negative value, which represents a punishment, denoted by \(C_2\). The immediate reward obtained at each time slot t is expressed as
$$\begin{aligned} r_n(t)= \left\{ \begin{array}{ll} \frac{C_1}{\omega _1 \cdot E_n^t+\omega _2 \cdot C_n^m (t)}&{} \mathbf {a}_n^t\;is\;successful \\ C_2&{} otherwise \end{array}\right. \end{aligned}$$(14)where \(C_1\) and \(C_2\) are positive constants. It is noted that, to maximize the estimation of discounted accumulative rewards, for a successful action, lower total system cost corresponding multicloud offloading decision leads to higher immediate reward.
Multiagent DRL framework
For centralized decision making DRL, which requires the BS to collect the environment state from all IoT devices and cloud servers. However, with the number of IoT devices or cloud servers increasing, the communication overhead would increase, as well as the stateaction space may grow exponentially, resulting in the poor convergence efficiency. To deal with these challenges, we aim at obtaining promising computation offloading solutions with multiagent DRL settings. However, traditional multiagent DRLs still hit bottlenecks of overestimate and high variance, considering the highdimensional discreatecontinuous action space, The TD3 algorithm is designed to find efficient probability of selected cloud server \(Pro(m_n^t)\), offloading ratio \(\alpha _n^t\) and local computing capability \(f_n^t\), based on dynamic multicloud environments.
In the multicloud offloading system, take advantage of local observations at each IoT device, IoT device n determines their server selection \(Pro(m_n^t)\), task offloading ratio \(\alpha _n^t\) and local compution capacity \(f_n^t\). Thus, twin delayed DDPG (TD3) agent is employed to learn distributed computation offloading policies by jointly optimizing above three variables for each IoT device. This is referred to as the cooperative multiagent TD3 (CMATD3) framework [32]. FigureÂ 2 is the framework of the CMATD3 in multicloud system. Following the centralized training and distributed execution strategy, each agentâ€™s actor network makes offloading decision according to the local observation of the network state, which will also be trained in proxy server located near BS. Then, the training parameters are periodically synchronized to each agentâ€™s actor network. On the other hand, each agentâ€™s twocritic network with global observation is deployed in the proxy server, i.e. states and actions of all agents. Therefore, from the perspective of each agent, the learning environment is stationary, regardless of any agentâ€™s policy changes.
The training stages of CMATD3 agent is illustrated as follows. In each time slot t, for each agent n, the global observable twocritic network in the proxy server is exploited to train actor network, so as to obtain the computation offloading strategy. In addition, to stabilize training process and improve the training effectiveness, for each agent n, the local experience transition \(\left(\mathbf {s}_n^t,\mathbf {a}_n^t,r_n^t,s_n^{t+1}\right)\) will be store in the experience replay buffer deployed in the proxy server, which concatenates the local experience transition of all agents together as a global experience replay buffer \(\mathcal {B}\), expressed as \((\mathbf {s}_t,\mathbf {a}_t,r_t,\mathbf {s}_{t+1})=\left(\mathbf {s}_1^t,\mathbf {a}_1^t,r_1^t,\mathbf {s}_1^{t+1};\cdot \cdot\cdot ; \mathbf {s}_n^t,\mathbf {a}_n^t,r_n^t,\mathbf {s}_n^{t+1};\cdot \cdot\cdot ;\mathbf {s}_N^t,\mathbf {a}_N^t,r_N^t,\mathbf {s}_N^{t+1}\right)\).
Then, for each agent \(n, n = 1, 2, \cdots , N\), the actor function is approximated by DNN with parameter \(\theta ^{\mu _n}\) as \(\pi _n^\mu (\mathbf {s}_n)\), which takes the state \(\mathbf {s}_n\) as input. Besides, the twocritic network Qfunction is also approximated by two DNN with parameter \(\theta ^{Q_n^i}\) as \(Q_n^{\theta _i}(\mathbf {s,a}\theta ^{Q_n^i})\), \(i = 1, 2\),, which takes the global state \(\mathbf {s}=(\mathbf {s}_1,\cdots ,\mathbf {s}_N)\) and action set \(\mathbf {a}=(\mathbf {a}_1,\cdots ,\mathbf {a}_N)\) as input. During the training process, each agent randomly samples a minibatch \(\{ \mathbf {s}_j,\mathbf {a}_j,r_j,\mathbf {s}_j^{\prime }\}_{{\varvec{j=1}}}^{\varvec{I}}\) from the global experience replay buffer \(\mathcal {B}\). The policy gradient of the evaluation actor network can be derived as
In addition, to avoid overfitting on the narrow peaks of Qvalues, the target action \(\mathbf {a}_j^{\prime }\) is defined as \(\mathbf {a}_j^{\prime }=\pi _n^{\mu ^{\prime }}(\mathbf {s}_j^{\prime })+\mathbb {N}\), where \(\mathbb {N} \backsim clip(\mathfrak {N}(0,\breve{\sigma }^2),1,1)\) is clipped noise adding to target actor network with mean 0 and standard deviation \(\breve{\sigma }\). This noise helps TD3 to achieve smoother stateaction estimation. Based on the target policy smoothing scheme above, the target values \(y_j\) is defined as
Then, as mentioned above, the two Qfunctions, including \(Q_n^{\theta _1} (s_j,\mathbf {a}_j)\) and \(Q_n^{\theta _2} (s_j,\mathbf {a}_j)\), are concurrently obtained from twocritic network. The weight parameters \(\theta ^{Q_n^i}\) of \(Q_n^{\theta _i} (s_j,\mathbf {a}_j), i = 1, 2\), are updated by minimizing the loss function \(L(\theta _i)\), given as
Next, based on the Eq. (15) and Eq. (17), let \(\lambda\) be the learning rate, the weight of evaluation actor network and two evaluation critic networks are updated by
In the end, to reduce temporal difference (TD) error, each agent updates the evaluation actor networkâ€™s weights with a lower frequency. Here, the each IoT device updates the evaluation actor network every \(\Gamma\) time slots.
Finally, aiming at stabilize the training process, each agent copys the weights of corresponding evaluation networks, and updates the weights of target actor network and target twocritic network. Thus, the the weights of target actor network and target two critic network are obtained as
where \(\eta\) is the updating rate.
The time complexity of Algorithm 1 mainly depends on the number of IoT devices, as well as the structure of the neural networks for executing the actor and twocritic network of each TD3 agent. For each TD3 agent, we assume that number of fully connected layers of actor network and twocritic network is J and 2L, respectively. Thus, the time complexity can be calculated as
where N is the number of agents in the multidevice, multicloud environment, \(u_{A,l}\) stands for the unit number of layer l in the actor network, \(u_{C,j}\) represents the unit number of layer j in the twocritic network. Note that \(u_{A,0}\) and \(u_{C,0}\) are the same as input size of actor network and twocritic network, respectively.
Simulation results
Experiment setup
In this experiment, IoT devices are devised to interact with multicloud servers, which is to present in detail how the offloading policy changes with the environment. The \(\breve{d}_n^t\) satisfies a Poisson process with the mean data arriving rate 300 kbps. The \(d_n^t\) is uniformly distributed in [1, 7] Mbits, the \(\bar{D}_n^t\) is uniformly generated in [2, 5] s, the \(c_n^t\) is uniformly generated in [200, 500] cycles/bit. Each time slot last 1 s.
Besides, the system parameters are set as: maximum computation capacity \(F_n^{max}=0.5 GHz\), the noise power \(\varrho ^2=174 dBm/Hz\) [33], transmission power \({P}_n=2Watt\), and effectively switching capacitance constant \(\kappa =10^{27}\). Channel gain \(g_n^t\) is exponentially distributed with mean \(g_0 \cdot (rad_0/rad_n)^e\), where the pathloss constant \(g_0=30 dB\), the reference distance, \(rad_0=1 m\), the distance between BS and IoT device n, \(rad_n\), and the pathloss exponent \(e=3\), respectively.The computing capability of each cloud server \(f_m^t\) is uniformly generated in [2, 6] GHz.
For the proposed CMATD3 framework, both the actor and twocritic networks are fourlayer fully connected neural network with two hidden layers, where the number of neurons in the two layers are 400 and 300, respectively. The learning rates of the actor network is initialized as 0.0001. We set the maximum experience replay buffer size \(\mathcal {B} = 2.5\times 10^5\), the target net update rate \(\eta = 0.005\), and the discount factor \(\gamma = 0.99\), respectively. In the training stage, the total number of episodes \(K_{max} = 2000\), and maximal time slots in each episode is \(T = 200\). The Adam optimizer is used to optimize the loss function during training. In the testing stage, the results obtained in 100 runs are averaged.
We run all experiments on a workstation with Intel Xeon E52667V4 8Core CPU\(\times\)2 @3.2GHz, 128 GB RAM, and 4\(\times\)NVIDIA GTX Titan V 12G GPU. It takes around 130 sec to run an episode on average.
Parameter study of CMATD3 agent
To verify the training efficiency, we study the impact of parameters on the performance of the proposed CMATD3 agent, including the learning rate and batch size, as shown in Fig.Â 3(a) and (b). The training process of CMATD3 agent is usually conducted offline. The number of cloud servers M is set to 3, the number of IoT devices N is set to 3. FigureÂ 3(a) shows the normalized average reward of CMATD3 with different learning rates in twocritic networks. With a small learning rate, i.e., 0.0001, the CMATD3 agent cannot reach to high reward values, since the update of DNNâ€™ parameters is trivial. On the contrary, a large learning rate, i.e., 0.01, may leads to rapid changes to the weight parameters of DNN. Obviously, 0.001 is more appropriate than 0.01 and 0.0001. Thus, we hereafter fix the learning rate to 0.001. FigureÂ 3(b) depicts the normalized average rewards of CMATD3 with different batch sizes. As shown in Fig.Â 3(b), both 32 and 128 lead to a deteriorated training performance, and the cumulative reward curve oscillates at low values. This is because a small batch size cannot efficiently cover the majority of transitions stored in the experience replay buffer. While a large batch size may lead to previously noneffective transitions are frequently sampled and trained from the experience replay buffer. Hence, we hereafter set the batch size to 64.
In order to investigate the scalability of our proposed CMATD3 agent, we evaluate the performance with different numbers of cloud servers and IoT devices, as shown in Fig.Â 4. We can find that with the number of IoT devices increases, there are more computation waiting to be offloading, which results in the higher total system cost. On the other hand, there are more cloud servers participating in computation offloading, as the number of cloud server M increasing. Besides, when the number of IoT devices and tasks is constant, the more cloud server participating in, the lower total system cost will be obtained. Nevertheless, it is unnecessary for more cloud servers to participate in computing offloading with a few of IoT devices and tasks. Take the number of IoT devices \(N = 3\) as an example, the performance of \(M = 2\) is almost similar to that of \(M = 3\). Moreover, when the number of cloud server M fixes to 3 and the number of IoT device N increases to 8, the CMATD3 agent is still competent for the multicloud computation offloading problem. These above verify the high scalability of the proposed CMATD3 agent with regard to cloud servers, state and action spaces.
FigureÂ 5 displays the relationship between the energy consumption and charge renting with the weight parameter \(\omega _2\). Note that the weigh parameter \(\omega _1\) is set to 1. Specifically, the \(\omega _1\) and \(\omega _2\) indicate the relative importance of energy consumption and renting charge, respectively. For example, a small \(\omega _2\) means more weight putting on the energy consumption. In the Fig.Â 5, as the weight \(\omega _2\) increases from 0.2 to 1.8, the renting charge gets more emphasized, and less tasks are offloaded to cloud server, which results in less renting charge and more energy consumption. Nevertheless, when the \(\omega _2\) increases to 1.6 and 1.8, the curve of renting charge decreases slow down. This is because the computation capacity of the cloud servers offered are limited, and less offloaded task data lead to higher energy consumption of IoT devices.
FigureÂ 6 is the performance gap between the proposed CMATD3 and the theoretical optimal result. We obtain the theoretical optimal result at each time slot, and mark it as black line. Besides, the experimental results by implementing CMATD3 according to experiment setup are get. It can be observed that the theoretical optimal results are almost close to 0.9, while the normalized system costs oscillate around 0.8. The average gap between optimal result and experimental result is less than 0.1. This is why our proposed CMATD3 can achieve near optimal results.
Performance evaluation and analysis
To validate the effectiveness and advantage of the proposed CMATD3 algorithm for multicloud task offloading, we conduct extensive comparative experiments with changing system parameters. On one hand, the performance of four DRL based algorithms (i.e., MADQN(5), MADQN(10), MDHybridAC [22], and MADDPG [34]) are assessed. On the other hand, two heuristic algorithms (i.e. ACO [11] and SPSOGA [12]) are also evaluated as follows.

MADQN: Action values will be quantized firstly when coping with dynamic multicloud offloading problems with a continuousdiscrete hybrid action space. We develop two multiagent based on the different number of discretized levels. MADQN(5): For an agent allocated at an IoT device, range of both decision variables, e.g. offloading ratio \(\alpha _n^t\) and computation capacity \(f_n^t\), are equally divided the into 5 levels. In addition, the range of cloud server selection \(m_n^t\) is 3. Thus, the action dimension of each agent is 13. MADQN(10): For each agent, the range of both decision variables, e.g. offloading ratio \(\alpha _n^t\) and computation capacity \(f_n^t\), are equally divided the into 10 levels.

MDHybridAC [22]: The improvement of actorcritic architecture to tack the continuousdiscrete hybrid decision based computation offloading problem, with centralized training and decentralized execution framework adopted.

MADDPG [34]: A cooperative multiagent DDPG framework, which is employed to learn decentralized dynamic computation offloading policies.

CMATD3: The proposed agent in this paper.
In the MADQN(5), MADQN(10), MDHybridAC, and MADDPG, the hyperparameters for the DNNs networks are exactly the same with CMATD3.
Convergence of the five algorithms
FigureÂ 7 shows the convergence of the five agents during training. We can easily observe that the normalized reward steadily grows up, with training episodes increasing. A larger episode leading to a higher normalized reward. One can see clearly that CMATD3 result in best convergence among all algorithms in terms of the normalized reward. This is because the two independent critic networks in TD3 can efficiently alleviate the overestimation issue, improving the training stability and effectiveness.
Performance comparison against currently processed task data size
FigureÂ 8(a) and (b) display the influence of different currently processed task data size, \(d_n^t\), on the performance of total system cost with the case of cloud server \(M = 2\) and \(M = 3\). As the task data size increasing, the total energy consumption steadily grows up, leading to the performance of the five DRL agents deteriorates. Furthermore, with more cloud server joining in task offloading, the lower total system cost can be achieved, which is shown in Fig.Â 8(a) and (b). FigureÂ 8 shows the MADQN(5), MADQN(10) consistently have high system costs because their inflexible and naive behaviors. The MDhybridAC agent has a comparable performance with MADDPG when the task data size is not heavy, and the performance deteriorates even more with the increase of the task data size. Besides, the CMATD3 agent outperforms MDHybridAC with lower system cost on average distribution range of \(d_n^t\), which means that CMATD3 adapts to new learning task better due to the coordination among IoT device agents.
Performance comparison against system bandwidth
We evaluate the total system cost of five DRLbased agents with different system bandwidths in the scenarios of cloud server \(M = 2\) and \(M = 3\). In the Fig.Â 9, it can easily observe that as the system bandwidth W increases, the total system cost of DRLbased optimization methods goes down. The is because the transmission rate for IoT device n to cloud server m gradually goes up, which results in low transmission energy consumption. Then, the total system cost decreases in each DRLbased agent with different number of cloud servers. Clearly, compared with the case of cloud server \(M = 2\) in Fig.Â 9(a), more cloud server will participate in computation offloading with the number of cloud server \(M = 3\) in Fig.Â 9(b), which contributes to lower total system cost in five DRLbased optimization methods. Obviously, the results show that the CMATD3 decreases gradually and still maintains a lowest system cost among other schemes when the system bandwidth increases. This is because CMATD3 makes better decisions on server selection, offloading ratio and local computation capacity , compared with MADQN(5), MADQN(10), MADDPG, and HybridAC agents.
Performance comparison against number of IoT devices
In multicloud environments, coordination among IoT devices is more challenging since the number of IoT devices may change with some one leaves or arrives. Therefore, to further analyze the scalability of the five DRLbased agents, we discuss the impact of N on the total system cost. Besides, for the sake of simplicity, all IoT devices are assumed to randomly scattered between 500m and 1000m to the BS, the number of cloud server is set to 3.
FigureÂ 10 shows the performance of the five DRLbased computation offloading schemes with different N. In the Fig.Â 10, as the number of IoT devices increases, the average cost of each agent gradually grows up. The reason is explained below. A larger N leads to a higher probability that more IoT devices communicate with the cloud servers at the same time, resulting into more severe interference among IoT devices. In this case, it takes more energy consumption to transmit a given amount of data, which leads to more average normalized system cost during the uplink data transmission process.
One can see clearly that the performance of proposed CMATD3 agent significantly better than the other four DRL based agents. Then, in the case of \(N < 6\), the total system cost of MADDPG agent is closer to both of CMATD3 and MDHybridAC agents, this is because the task data incurred on each IoT device is uniformly distributed. The performances of MADQN(5) and MADQN(10) do not exhibit well since the searching space of them is extremely large as the number of IoT devices increases and thus resulting in a serious performance degradation. Compared with MADQN(5), MADQN(10) improves the performance slightly with the increase of quantized levels, but far lower than the proposed CMATD3. The reason is that the quantization process induces quantization noise, which loses many features of action and impedes MADQN to find the optimal policy. Besides, MDHybridAC has less performance degradation than CMATD3 under different N since the MDHybridAC cannot efficiently adapt to the states of network scale.
TableÂ 1 is the performance comparation of CMATD3 with heuristic algorithms, including ACO [11] and SPSOGA [12], under different number of IoT devices. Obviously, STDPG outperforms ACO algorithm and SPSOGA algorithm as it always obtains the smallest normalized system costs. For instance, when the number of IoT devices N = 8, The normalized system costs of our proposed CMATD3 is 0.67 as against 0.998 and 0.821for ACO and SPSOGA. The following explains why. The CMATD3 algorithm takes advantage of centralized training and distributed executing. For ACO and SPSOGA algorithms, both of them normally need a considerable amount of iterations to achieve a near optimum. As the number of IoT devices increasing, the search action grows exponentially, they may easily fall into local optimum during optimal processing.
Conclusions and future work
This paper investigated the dynamic computation offloading problem in a hybriddecisionbased collaborative multicloud computing network, in which the timevarying computing requirements, wireless channel gains and network scale are comprehensively considered. The optimization problem was formulated to obtain the minimum longterm average total system cost of energy consumption of IoT devices and renting charge of cloud servers. To solve the issues of hybrid decision and collaboration among different IoT devices, we addressed the issues by two steps. Specifically, we first relaxed discrete action (e.g. cloud server selection) into a continuous set by designing a probabilistic method. Then, a cooperative multiagent DRL (CMADRL) based framework with each IoT device acting as an agent, was developed to obtain the optimal cloud server selection, offloading ratio and local computation capacity. Experimental results have been performed to verify the effectiveness and superiority of the proposed CMADRL based framework over the other four state of the art DRLbased frameworks.
For our future work, we will consider to establish edgecloud computing network system to execute computing tasks collaboratively. Moreover, we will study how the computation complexity and communication overhead of the training process are reasonably decreased, we will try to task advantage of federated learning based DRL, which only requires BS agents to share their model parameters instead of local training data.
Availability of data and materials
The data used during the current study are available from the corresponding author on reasonable request.
Abbreviations
 DRL:

Deep reinforcement learning
 IoT:

InternetofThings
 BS:

Base station
 CMADRL:

Cooperative multiagent deep reinforcement learning
 DDPG:

Deep deterministic policy gradient
 TD3:

Twin delayed DDPG
 MATD3:

Multiagent TD3
References
Gai K, Guo J, Zhu L, Yu S (2020) Blockchain meets cloud computing: a survey. IEEE Commun Surv Tutorials 22(3):2009â€“2030
Li K, Zhao J, Hu J et al (2022) Dynamic energy efficient task offloading and resource allocation for nomaenabled iot in smart buildings and environment. Build Environ. https://doi.org/10.1016/j.buildenv.2022.109513
Chen C, Zeng Y, Li H, Liu Y, Wan S (2022) A multihop task offloading decision model in mecenabled internet of vehicles. IEEE Internet Things J: 1. https://doi.org/10.1109/JIOT.2022.3143529
Chen Y, Zhao F, Lu Y, Chen X () Dynamic task offloading for mobile edge computing with hybrid energy supply. Tsinghua Sci Technol https://doi.org/10.26599/TST.2021.9010050
Chen Y, Xing H, Ma Z, etÂ al (2022) Costefficient edge caching for nomaenabled iot services. China Communications
Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I et al (2010) A view of cloud computing. Commun ACM 53(4):50â€“58
Dinh HT, Lee C, Niyato D, Wang P (2013) A survey of mobile cloud computing: architecture, applications, and approaches. Wirel Commun Mob Comput 13(18):1587â€“1611
Huang J, Tong Z, Feng Z (2022) Geographical poi recommendation for internet of things: A federated learning approach using matrix factorization. Int J Commun Syst e5161 https://doi.org/10.1002/dac.5161
Apostolopoulos PA, Fragkos G, Tsiropoulou EE, Papavassiliou S (2021) Data offloading in uavassisted multiaccess edge computing systems under resource uncertainty. IEEE Trans Mob Comput: 1. https://doi.org/10.1109/TMC.2021.3069911
Tang X (2021) Reliabilityaware costefficient scientific workflows scheduling strategy on multicloud systems. IEEE Trans Cloud Comput: 1. https://doi.org/10.1109/TCC.2021.3057422
Addya SK, Satpathy A, Ghosh BC, Chakraborty S, Ghosh SK, Das SK (2021)Â CoMCLOUD: Virtual machine coalition for multitier applications over multicloud environments. IEEE Trans Cloud Comput: 1. https://doi.org/10.1109/TCC.2021.3122445
Chen X, Zhang J, Lin B, Chen Z, Wolter K, Min G (2022) Energyefficient offloading for dnnbased smart iot systems in cloudedge environments. IEEE Trans Parallel Distrib Syst 33(3):683â€“697. https://doi.org/10.1109/TPDS.2021.3100298
Chen Y, Zhao J, Wu Y (2022) QoEaware decentralized task offloading and resource allocation for endedgecloud systems: A gametheoretical approach. IEEE Trans Mob Comput. https://doi.org/10.1109/TMC.2022.3223119
Xu J, Li D, Gu W et al (2022) Uavassisted task offloading for iot in smart buildings and environment via deep reinforcement learning. Build Environ. https://doi.org/10.1016/j.buildenv.2022.109218
Tuli S, Ilager S, Ramamohanarao K, Buyya R (2020) Dynamic scheduling for stochastic edgecloud computing environments using a3c learning and residual recurrent neural networks. IEEE Trans Mob Comput 21(3):940â€“954. https://doi.org/10.1109/TMC.2020.3017079
Zhang Y, Di B, Zheng Z, Lin J, Song L (2020) Distributed multicloud multiaccess edge computing by multiagent reinforcement learning. IEEE Trans Wirel Commun 20(4):2565â€“2578
Chen Z, Hu J, Min G, Luo C, ElGhazawi T (2022) Adaptive and efficient resource allocation in cloud datacenters using actorcritic deep reinforcement learning. IEEE Trans Parallel Distrib Syst 33(8):1911â€“1923. https://doi.org/10.1109/TPDS.2021.3132422
Zhou P, Wu G, Alzahrani B, Barnawi A, Alhindi A, Chen M (2021) Reinforcement learning for task placement in collaborative cloudedge computing. In: 2021 IEEE Global Communications Conference (GLOBECOM). IEEE,Â Madrid, pp 1â€“6
Qu G, Wu H, Li R, Jiao P (2021) Dmro: A deep meta reinforcement learningbased task offloading framework for edgecloud computing. IEEE Trans Netw Serv Manag 18(3):3448â€“3459
Chen L, Xu Y, Lu Z, Wu J, Gai K, Hung PC, Qiu M (2020) Iot microservice deployment in edgecloud hybrid environment using reinforcement learning. IEEE Internet Things J 8(16):12610â€“12622
Chen Y, Sun Y, Wang C, Taleb T (2022) Dynamic task allocation and service migration in edgecloud iot system based on deep reinforcement learning. IEEE Internet Things J 9(18):16742â€“16757. https://doi.org/10.1109/JIOT.2022.3164441
Zhang J, Du J, Shen Y, Wang J (2020) Dynamic computation offloading with energy harvesting devices: A hybriddecisionbased deep reinforcement learning approach. IEEE Internet Things J 7(10):9303â€“9317
Oroojlooyjadid A, Hajinezhad D (2019) A review of cooperative multiagent deep reinforcement learning. https://doi.org/10.48550/arXiv.1908.03963
MuĂ±oz O, PascualIserte A, Vidal J (2015) Optimization of radio and computational resources for energy efficiency in latencyconstrained application offloading. IEEE Trans Veh Technol 64(10):4738â€“4755. https://doi.org/10.1109/TVT.2014.2372852
Chen Y, Zhao F, Chen X, Wu Y (2022) Efficient multivehicle task offloading for mobile edge computing in 6g networks. IEEE Trans Veh Technol 71(5):4584â€“4595. https://doi.org/10.1109/TVT.2021.3133586
Chen J, Wu Z (2021) Dynamic computation offloading with energy harvesting devices: A graphbased deep reinforcement learning approach. IEEE Commun Lett 25(9):2968â€“2972. https://doi.org/10.1109/LCOMM.2021.3094842
Chen J, Xing H, Xiao Z, Xu L, Tao T (2021) A drl agent for jointly optimizing computation offloading and resource allocation in mec. IEEE Internet Things J 8(24):17508â€“17524. https://doi.org/10.1109/JIOT.2021.3081694
Chen C, Jiang J, Zhou Y, Lv N, Liang X, Wan S (2022) An edge intelligence empowered flooding process prediction using internet of things in smart city. J Parallel Distrib Comput 165:66â€“78
Huang J, Gao H, Wan S et al (2023) Aoiaware energy control and computation offloading for industrial iot. Futur Gener Comput Syst 139:29â€“37
Chen C, Li H, Li H, Fu R, Liu Y, Wan S (2022) Efficiency and fairness oriented dynamic task offloading in internet of vehicles. IEEE Trans Green Commun Netw
Lowe R, Wu Y, Tamar A, Harb J (2017) Multiagent actorcritic for mixed cooperativecompetitive environments. https://doi.org/10.48550/arXiv.1706.02275
Fujimoto S, Hoof HV, Meger D (2018) Addressing function approximation error in actorcritic methods. https://doi.org/10.48550/arXiv.1802.09477
Chen Y, Gu W, Xu J, etÂ al (2022a) Dynamic task offloading for digital twinempowered mobile edge computing via deep reinforcement learning. China Commun
Chen Z, Zhang L, Pei Y, Jiang C, Yin L (2022) Nomabased multiuser mobile edge computation offloading via cooperative multiagent deep reinforcement learning. IEEE Trans Cogn Commun Netw 8(1):350â€“364. https://doi.org/10.1109/TCCN.2021.3093436
Acknowledgements
The authors would like to thank all the staf and students of school of computer and software engineering in Xihua university for contribution during this research process.
Funding
The work of this paper is supported by the National Science Foundation of China (No. 62171387).
Author information
Authors and Affiliations
Contributions
Problem formulation: Juan Chen, Peng Chen. The proposed algorithm: Peng Chen, Xianhua Niu. Computer simulations: Ling Xiong, Canghong Shi. Article preparation: Juan Chen, Zongling Wu. The authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisherâ€™s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Chen, J., Chen, P., Niu, X. et al. Task offloading in hybriddecisionbased multicloud computing network: a cooperative multiagent deep reinforcement learning. J Cloud Comp 11, 90 (2022). https://doi.org/10.1186/s13677022003729
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13677022003729