In this section, we first relax the continuediscrete decision variable to continue decision variable. Then, we model a multiagent Markov Decision Process (MDP) for task offloading optimal problem. Finally, the procedure of cooperative twin delayed DDPG (CMATD3) is introduced in detail.
Discrete decision variable relaxation
To address these challenges in Eq. (10), for the discrete decision variable \(m_n^t\) , we adopt a probabilistic method to convert it as a continuous variable. In particular, let \(Pro(m_n^t ) \in [\frac{m1}{M} ,\frac{m}{M}]\) be the probability of the task offloads to cloud server m. In other words, if the IoT device n chooses a continue decision variable \(Pro(m_n^t )\), which can be considered as the IoT device n selects cloud server m to offload its task at time slot t. Taking the number of cloud servers \(M = 5\) as an example, if the \(Pro(m_n^t ) \in \left[\frac{1}{5} ,\frac{2}{5}\right]\), we can choose server \(m = 2\) to offload the task. Therefore, the total system cost minimization problem can be reformulated as follows,
$$\begin{aligned} {\underset{Pro(m_n^t),\alpha _n^t,f_n^t}{\min}}\left({\underset{T \rightarrow \infty }{\lim}} \frac{1}{T} \sum \limits _{t=1}^T\left(\sum \limits _{n=1}^{N}\left(\omega _1 \cdot E_n^t + \omega _2 \cdot C_n^m(t)\right)\right)\right) \end{aligned}$$
(11)
s.t.
$$\begin{aligned} C1&:&max(D_n^{loc}(t),D_{n,m}^{tra}(t)+D_n^m(t)) \le \bar{D}_n^t,\ \forall n \in \mathcal {N} ,\forall t \in \mathcal {T} \end{aligned}$$
(11a)
$$\begin{aligned} C2&:&0\le Pro(m_n^t)\le 1,\forall m \in \mathcal {M} ,\forall n \in \mathcal {N} ,\forall t \in \mathcal {T} \end{aligned}$$
(11b)
$$\begin{aligned} C3&:&0\le \alpha _n^t \le 1 ,\forall n \in \mathcal {N} ,\forall t \in \mathcal {T} \end{aligned}$$
(11c)
$$\begin{aligned} C4&:&0\le f_n^t \le F_n^{max},\forall n \in \mathcal {N} ,\forall t \in \mathcal {T} \end{aligned}$$
(11d)
Based on this setting, DRL agent will be introduced to solve this dynamic offloading problem. For centralized decision making DRL, which require the BS to collect the environment state from all IoT devices. However, this would increase the communication overhead. Therefore, this paper aims at obtaining promising computing offloading solutions with multiagent settings.
Since the network environment is nonstationarity, other agents change their policy in the training process, this leads to the performance of traditional multiagent DRL becomes unstable. In order to guarantee convergence, we design a cooperative multiagent deep reinforcement learning based framework, which leverages the strategy of centralized training and distributed execution by using locally executable actor networks and fully observable critic networks [31].
MDP Formulation
We model the task offloading optimal problem as a multiagent Markov decision process (MDP). The multiagent MDP can be denoted by a 4 tuple \((\mathcal {N} , S_n, A_n, r_n)\), where \(\mathcal {N}\) is the agent space, \(S_n\) is the state space of agent n, \(A_n\) is the action space of agent n, and \(r_n\) is the reward function of agent n, respectively.

Agent space \(\mathcal {N}:\mathcal {N} = \{1,\ldots ,N\}\), where N is the number of IoT devices, each IoT device acts as an agent. For agent n, \(n = 1, 2, \ldots , N\), by determining the server selection \(Pro(m_n^t)\), offloading ratio \(\alpha _n^t\), and local computation capacity \(f_n^t\), the agent can obtain the minimum total system cost.

State space \(S_n\): For IoT device n, the state \(\mathbf {s}_n^t\) is composed of computation task, the channel gain between IoT device n and BS in time slot t, and computation capacity of all cloud servers in the multicloud system.
$$\begin{aligned} \mathbf {s}_n^t=\Big \{Tas_n^t;g_{n}^t;f_1^t,\cdots ,f_m^t\Big \} \end{aligned}$$
(12)

Action space \(A_n\): Since each IoT device is required to determine probability of its selected cloud server \(Pro(m_n^t)\), offloading ratio \(\alpha _n^t\) and local computing capability \(f_n^t\), the action space can be given by
$$\begin{aligned} \mathbf {a}_n^t=\Big \{Pro(m_n^t),\alpha _n^t,f_n^t\Big \} \end{aligned}$$
(13)

Reward function \(r_n\): To obtain the nearoptimal policy for the task offloading optimization problem in Eq. (11), the numbers of agent should cooperate to minimize the total system cost. In other words, reward function \(r_n^t\) is set to instruct agent working at IoT device n to learn to make decisions that satisfy the constraints. The action is successful if the decision variables corresponding to the action do not violate any of the constraints defined in Eqs. (11a)  (11d), then the reward is defined as the product of the reciprocal of the weight sum \((\omega _1 \cdot E_n^t+\omega _2 \cdot C_n^m (t))\) and a constant \(C_1\). Otherwise, the reward is defined as a negative value, which represents a punishment, denoted by \(C_2\). The immediate reward obtained at each time slot t is expressed as
$$\begin{aligned} r_n(t)= \left\{ \begin{array}{ll} \frac{C_1}{\omega _1 \cdot E_n^t+\omega _2 \cdot C_n^m (t)}&{} \mathbf {a}_n^t\;is\;successful \\ C_2&{} otherwise \end{array}\right. \end{aligned}$$
(14)
where \(C_1\) and \(C_2\) are positive constants. It is noted that, to maximize the estimation of discounted accumulative rewards, for a successful action, lower total system cost corresponding multicloud offloading decision leads to higher immediate reward.
Multiagent DRL framework
For centralized decision making DRL, which requires the BS to collect the environment state from all IoT devices and cloud servers. However, with the number of IoT devices or cloud servers increasing, the communication overhead would increase, as well as the stateaction space may grow exponentially, resulting in the poor convergence efficiency. To deal with these challenges, we aim at obtaining promising computation offloading solutions with multiagent DRL settings. However, traditional multiagent DRLs still hit bottlenecks of overestimate and high variance, considering the highdimensional discreatecontinuous action space, The TD3 algorithm is designed to find efficient probability of selected cloud server \(Pro(m_n^t)\), offloading ratio \(\alpha _n^t\) and local computing capability \(f_n^t\), based on dynamic multicloud environments.
In the multicloud offloading system, take advantage of local observations at each IoT device, IoT device n determines their server selection \(Pro(m_n^t)\), task offloading ratio \(\alpha _n^t\) and local compution capacity \(f_n^t\). Thus, twin delayed DDPG (TD3) agent is employed to learn distributed computation offloading policies by jointly optimizing above three variables for each IoT device. This is referred to as the cooperative multiagent TD3 (CMATD3) framework [32]. Figure 2 is the framework of the CMATD3 in multicloud system. Following the centralized training and distributed execution strategy, each agent’s actor network makes offloading decision according to the local observation of the network state, which will also be trained in proxy server located near BS. Then, the training parameters are periodically synchronized to each agent’s actor network. On the other hand, each agent’s twocritic network with global observation is deployed in the proxy server, i.e. states and actions of all agents. Therefore, from the perspective of each agent, the learning environment is stationary, regardless of any agent’s policy changes.
The training stages of CMATD3 agent is illustrated as follows. In each time slot t, for each agent n, the global observable twocritic network in the proxy server is exploited to train actor network, so as to obtain the computation offloading strategy. In addition, to stabilize training process and improve the training effectiveness, for each agent n, the local experience transition \(\left(\mathbf {s}_n^t,\mathbf {a}_n^t,r_n^t,s_n^{t+1}\right)\) will be store in the experience replay buffer deployed in the proxy server, which concatenates the local experience transition of all agents together as a global experience replay buffer \(\mathcal {B}\), expressed as \((\mathbf {s}_t,\mathbf {a}_t,r_t,\mathbf {s}_{t+1})=\left(\mathbf {s}_1^t,\mathbf {a}_1^t,r_1^t,\mathbf {s}_1^{t+1};\cdot \cdot\cdot ; \mathbf {s}_n^t,\mathbf {a}_n^t,r_n^t,\mathbf {s}_n^{t+1};\cdot \cdot\cdot ;\mathbf {s}_N^t,\mathbf {a}_N^t,r_N^t,\mathbf {s}_N^{t+1}\right)\).
Then, for each agent \(n, n = 1, 2, \cdots , N\), the actor function is approximated by DNN with parameter \(\theta ^{\mu _n}\) as \(\pi _n^\mu (\mathbf {s}_n)\), which takes the state \(\mathbf {s}_n\) as input. Besides, the twocritic network Qfunction is also approximated by two DNN with parameter \(\theta ^{Q_n^i}\) as \(Q_n^{\theta _i}(\mathbf {s,a}\theta ^{Q_n^i})\), \(i = 1, 2\),, which takes the global state \(\mathbf {s}=(\mathbf {s}_1,\cdots ,\mathbf {s}_N)\) and action set \(\mathbf {a}=(\mathbf {a}_1,\cdots ,\mathbf {a}_N)\) as input. During the training process, each agent randomly samples a minibatch \(\{ \mathbf {s}_j,\mathbf {a}_j,r_j,\mathbf {s}_j^{\prime }\}_{{\varvec{j=1}}}^{\varvec{I}}\) from the global experience replay buffer \(\mathcal {B}\). The policy gradient of the evaluation actor network can be derived as
$$\begin{aligned} \nabla _{\theta ^{\mu _n}} J(\theta ^{\mu _n}) \approx \mathbb {E} \left[\nabla _{\theta ^{\mu _n}}\pi _n^{\mu } (\mathbf {s}_j)\cdot \nabla _{a_n}Q_n^{\theta _1}(\mathbf {s,a}\theta ^{Q_n^1})_{a_n=\pi _n^\mu (s_n)}\right] \end{aligned}$$
(15)
In addition, to avoid overfitting on the narrow peaks of Qvalues, the target action \(\mathbf {a}_j^{\prime }\) is defined as \(\mathbf {a}_j^{\prime }=\pi _n^{\mu ^{\prime }}(\mathbf {s}_j^{\prime })+\mathbb {N}\), where \(\mathbb {N} \backsim clip(\mathfrak {N}(0,\breve{\sigma }^2),1,1)\) is clipped noise adding to target actor network with mean 0 and standard deviation \(\breve{\sigma }\). This noise helps TD3 to achieve smoother stateaction estimation. Based on the target policy smoothing scheme above, the target values \(y_j\) is defined as
$$\begin{aligned} y_j = r_j + \gamma\ \underset{i=1,2}{min}\ Q_n^{\theta _i^{\prime }}\left(\mathbf {s}_j^{\prime },\mathbf {a}_j^{\prime }\theta ^{Q_n^i}\right),i=1,2. \end{aligned}$$
(16)
Then, as mentioned above, the two Qfunctions, including \(Q_n^{\theta _1} (s_j,\mathbf {a}_j)\) and \(Q_n^{\theta _2} (s_j,\mathbf {a}_j)\), are concurrently obtained from twocritic network. The weight parameters \(\theta ^{Q_n^i}\) of \(Q_n^{\theta _i} (s_j,\mathbf {a}_j), i = 1, 2\), are updated by minimizing the loss function \(L(\theta _i)\), given as
$$\begin{aligned} L(\theta ^{Q_n^i}) \approx \mathbb {E} \left[y_j  Q_n^{\theta _i}(s_j,\mathbf {a}_j)\right]^2,i=1,2. \end{aligned}$$
(17)
Next, based on the Eq. (15) and Eq. (17), let \(\lambda\) be the learning rate, the weight of evaluation actor network and two evaluation critic networks are updated by
$$\begin{aligned} \theta ^{\mu _n}\leftarrow & {} \theta ^{\mu _n}  \lambda \nabla _{\theta ^{\mu _n}} J(\theta ^{\mu _n}) \nonumber \\ \theta ^{Q_n^i}\leftarrow & {} \theta ^{Q_n^i}  \lambda \nabla _{\theta ^{Q_n^i}} L(\theta ^{Q_n^i}) ,i=1,2. \end{aligned}$$
(18)
In the end, to reduce temporal difference (TD) error, each agent updates the evaluation actor network’s weights with a lower frequency. Here, the each IoT device updates the evaluation actor network every \(\Gamma\) time slots.
Finally, aiming at stabilize the training process, each agent copys the weights of corresponding evaluation networks, and updates the weights of target actor network and target twocritic network. Thus, the the weights of target actor network and target two critic network are obtained as
$$\begin{aligned} \theta ^{\mu _n^{\prime }}= & {} \eta \theta ^{\mu _n} + (1\eta ) \theta ^{\mu _n^{\prime }} \nonumber \\ {\theta ^{Q_n^{i^{\prime }}}}= & {} \eta \theta ^{Q_n^i} + (1\eta ) \theta ^{Q_n^{i^{\prime }}} ,i=1,2. \end{aligned}$$
(19)
where \(\eta\) is the updating rate.
The time complexity of Algorithm 1 mainly depends on the number of IoT devices, as well as the structure of the neural networks for executing the actor and twocritic network of each TD3 agent. For each TD3 agent, we assume that number of fully connected layers of actor network and twocritic network is J and 2L, respectively. Thus, the time complexity can be calculated as
$$\begin{aligned}&N\cdot \left(2\sum \limits _{l=0}^{L}\left(u_{A,l} \cdot u_{A,l+1} + 4\sum \limits _{j=0}^{J}\left(u_{C,j} \cdot u_{C,j+1}\right)\right)\right) \nonumber \\&\quad =\mathcal {O} \left(N \cdot \left(\sum \limits _{l=0}^{L}\left(u_{A,l} \cdot u_{A,l+1} + \sum \limits _{j=0}^{J}\left(u_{C,j} \cdot u_{C,j+1}\right)\right)\right)\right) \end{aligned}$$
(20)
where N is the number of agents in the multidevice, multicloud environment, \(u_{A,l}\) stands for the unit number of layer l in the actor network, \(u_{C,j}\) represents the unit number of layer j in the twocritic network. Note that \(u_{A,0}\) and \(u_{C,0}\) are the same as input size of actor network and twocritic network, respectively.