 Research
 Open access
 Published:
RobustPAC timecritical workflow offloading in edgetocloud continuum among heterogeneous resources
Journal of Cloud Computing volumeÂ 12, ArticleÂ number:Â 58 (2023)
Abstract
Edgetocloud continuum connects and extends the calculation from edge side via network to cloud platforms, where diverse workflows go back and forth, getting executed on scheduled calculation resources. To better utilize the calculation resources from all sides, workflow offloading problems have been investigating lately. Most works focus on optimizing constraints like: latency requirements, resource utilization rate limits, and energy consumption bounds. However, the dynamics among the offloading environment have hardly been researched, which easily results in uncertain Quality of Service(QoS) on the user side. Any part of the workload change, resource availability change or network latency could incur dynamics in an offloading environment. In this work, we propose a robust PAC (probably approximately correct) offloading algorithm to address this dynamic issue together with optimization. We train an LSTMbased sequencetosequence neural network to learn how to offload workflows in edgetocloud continuum. Comprehensive implementations and corresponding comparison against stateoftheart methods demonstrate the robustness of our proposed algorithm. More specifically, our algorithm achieves better offloading performance regarding dynamic heterogeneous offloading environment and faster adaptation to newly changed environments than finetuned stateoftheart RLbased offloading methods.
Introduction
Wide use of edgetocloud continuum promotes a novel paradigm empowering intelligent and diverse applications in our daily life: intelligent transportation, intelligent home, and EHealthcare. However, such a paradigm also brings new challenges: the growing computation requirements on the user side, increasing data transmission, continuous interactive computation, and communication. With this trend, task offloading is a very widely used approach to better utilize diverse computation resources both on the edge side and cloud side, which contribute to an extended calculation pipeline togetheredgetocloud continuum. Within the popularity of the edgetocloud continuum, how to offload workflows properly matters in many contexts: energy consumption, latency control, and QoS. Moreover, with the evolution of the cellular network [1], the overall number of endusers is increasing dramatically [2, 3].With the rocketing development on both sides of users and service suppliers, offloading gains importance in a more heterogeneous environment where nodes have diverse capacities. The execution becomes more complicated with more resource options. Optimization on the edge side takes many aspects into account at the same time: execution capability, execution time, which are often contradicting against each other.
To address this NPhard problem, many works have been done [4,5,6]. Among them, machine learningbased approaches especially ReinforcementLearning(RL)based approaches have been investigated a lot: Â Liu, et al. [6] proposed a robust scheduling framework for independent tasks. Liu, et al. [7] proposed a multiobjective optimization framework for timecritical task scheduling. There also have been many works addressing heterogeneity in the offloading environment [8, 9]. Chen, et al. [10] propose an endedgecloud architecture of vehicles for task computation offloading, where considers three task computing methods. For the dynamically changing environment in the IoV, they adopt an Asynchronous Advantage ActorCritic (A3C) based computation offloading algorithm to solve the problem and seek optimal offloading decisions. As workflows consist of tasks and their dependencies, when the tasks come with timecritical constraints the workflows also need to take these constraints into account. Chen, et al. [11] develop a distributed multihop task offloading decision model for task execution efficiency, which consists of two parts: 1) a candidate vehicle selection mechanism for screening the neighboring vehicles that can participate in offloading and 2) a task offloading decision algorithm for obtaining the task offloading solution. Wei, et al. [12] improve the nondominated sorting genetic algorithm II (NSGAII) by modifying the initial population according to the matching factor, dynamic crossover probability and mutation probability to promote excellent individuals and increase population diversity. Therefore, when we optimize offloading policies, we also need to meet the timecritical or latency requirements of those workflows [13, 14].
However, after reviewing related papers and work done lately, we find that the robustness of the offloading performance has rarely been addressed in a dynamic heterogeneous resource edgetocloud continuum environment. The robustness of offloading performance refers to the stability of the offloading performance in a dynamic environment, regarding performance measurements. The absence of robustness results in offloading performance deviation, which brings in the uncertainties to latency. Furthermore, the uncertain latency influences the QoS even end up in violation of Service Legal Agreement(SLA). In our work, we propose a MetaPAC(probably approximately correct)ReinforcementLearningbased robust offloading algorithm(MLRLCDRLO) to address this issue in a heterogeneous environment. The main contributions of this paper include:

1
Workflow offloading in the heterogeneous environment: we build up a heterogeneous environment to investigate workflow offloading.

2
Timecritical workflow offloading: we design a PAC ReinforcementLearning scheme to learn offloading policy. The learning process is with maximum exploration limit, which is based on workflow latency. In this way it offers offloading latency guarantee and makes the learning process more efficient.

3
Robust workflow offloading: we propose a MetaLearningbased offloading algorithm, achieving more robust offloading performance compared with typical RLbased offloading approaches.
In the remainder of this paper, firstly we give the general formulation of the offloading in Problem formulation section. Followed by Related work section, where we go through the related work. Then we propose the detailed framework and algorithm MLRLCDRLO in Methodology section. Next, we evaluate the robustness performance and optimization performance with comprehensive implementations in Evaluation section. We further discuss the implementation results and make future work plans in Discussion section. Finally, Conclusion section summarizes the whole paper.
Problem formulation
We formulate offloading in a typical use case, as shown in Fig.Â 2, the workflow including the requests and corresponding dependencies firstly go to the local scheduler. After local scheduler makes the decision whether to calculate the request locally or offload them to MEC host. Between MEC host and end users, there is the MEC network connecting the two parts, including the up link and down link. Then if the decision is to offload the request to the MEC host, the request will be transmitted to the MEC host, where the offloading orchastrator will allocate them to different VMs through gateways. In this work, the resource composition of each VM on MEC host side is heterogeneous.
After we present the typical offloading pipeline, we formulate each part of the pipeline step by step. First of all, itâ€™s the workflow model. As is known, workflows consist of tasks and their dependencies. Here we define the workflow model as \(\mathcal {D}=(TA,\overrightarrow{ED})\) , where we use TA to represent the tasks set, based on this we use the vector \(\overrightarrow{ED}\) represents the dependencies, which are described as directed edge connected between the tasks respectively. We take \(\overrightarrow{ed}=(ta_i,ta_j)\) as an example, where \(\overrightarrow{ed}\) denotes the dependency between task \(ta_i\) and task \(ta_j\) meaning \(ta_j\) is an immediate successor task of \(ta_i\). We also formulate several principles of workflow models as follows:

1
As is shown in Fig.Â 1, for the two connected tasks, for example A and B, the one starts its execution earlier (A) is the leading task, the other one (B) is successor task.

2
The execution of a successor task only starts later than the ending of its leading taskâ€™s execution until the last one.

3
The tasks have no successor tasks are the exit tasks.
For different use cases and applications, the VMs and containers are getting more diverse, that is where the offloading heterogeneity comes. Based on the formulation of the workload, the heterogeneity of the environment comes from the heterogeneous resource composition of each VMs. Here we define \(\xi\) type of VMs, their computation capacities are represented as \(Cap_{l}, l\in [1,2,3,...,\xi ]\). For each task \(ta_i\), it has several information including: the resource requirement for running task, \(Cp_i\), the sent data sizes, \(Da_i^s\), and the received result data size , \(Da_i^r\). After we formulate the tasks model and the VMs, we turn to the MEC model, which consists of: the wireless uplink channel transmission rate, UT, and the downlink channel transmission rate DT. Based on this formulation, the latency of task \(ta_i\) sending data, \(Lat_i^{U}\), is calculated as:
getting executed on the MEC host, \(Ex_i^s\), is calculated as:
receiving the result data, \(Lat_i^{D}\), is calculated as:
When a task \(ta_i\) gets scheduled to be executed locally, the latency is just the time spent on local execution on the enduser side, which is calculated as
where \(Cap_{Lo}\) represents the computational capacity of the enduser.
Once a task \(ta_i\) gets offloaded to the MEC host, the total latency are the sum of latency from all parts, which includes local processing, uplink transmission, and remote processing latency and results transmission latency, as shown in Fig.Â 2. Based on the aforementioned model, we further formulate the offloading policy into \(Pol_{1:n}={a_1,a_2,...,a_n}\), where \(a_i\) represents the corresponding offloading decision of each \(ta_i\).
The finishing time of the process on the uplink channel, \(\mathcal {T}_i^{U}\), are defined as:
The finishing time of \(ta_i\)â€™s execution on the MEC host, \(FT_i^s\), and finishing time of its process on the downlink channel, \(FT_i^{D}\) are defined as:
The completion time of task \(ta_i\) on the end user side, \(FT_i^{UE}\), are defined as:
Overall, given a offloading policy model \(Pol_{1:n}\), the total latency of a DAG, \(Lat_{A_{1:n}}^c\), is defined as:
where \(\mathcal {K}\) denotes the exit tasks set, which consists of the tasks which have no successor tasks. In the next section, we will propose the detailed offloading algorithm based on the model formulation (TableÂ 1).
Related work
LearningBased Offloading Han, et al. [15] proposes a deep reinforcement learningbased approach to offloading decisionmaking in mobile edge computing. Min, et al.Â [16]proposed a deep RLbased offloading enabling the IoT device to optimize the offloading policy without knowledge of the MEC model, the energy consumption model, and the computation latency model. Dinh, et al.Â [17] proposed a modelfree reinforcement learning offloading mechanism which helps MUs learn their longterm offloading strategies to maximize their longterm utilities. Cheng, et alÂ [18] propose a deep reinforcement learningbased computing offloading approach to learn the optimal offloading policy onthefly, where we adopt the policy gradient method to handle the large action space and actorcritic method to accelerate the learning process. Some work also adopted LSTM network the to do prediction of the environment state [19]. MetaLearning has also been investigated [20] to offer an fast adaptive offloading methodMRLCO. Cao et al. proposed a novel multiagent DRL based approach [21], which adopts actcritic neural networks to calculate Qvalue based on corresponding reward function. DPM framework proposed by [22] applied the long shortterm memory (LSTM) neural network investigated the prediction and strategies of resource allocation under the objective of energy consumption reduction in cloudedge continuum.
Some work also adopted LSTM network the to do prediction of the environment state [19]. MetaLearning has also been investigated [20] to offer an fast adaptive offloading methodMRLCO. Cao et al. proposed a novel multiagent DRL based approach [21], which adopts actcritic neural networks to calculate Qvalue based on corresponding reward function. Shan et al. integrated DRL and Federated Learning to optimize resource allocation problems, which offers acceleration of DRL agents training. Lolos et al. proposed a novel fullmodel based RL [23] for elastic resource management, employs adaptive state space partitioning.
Resource Heterougeneity Guan, et al.Â [24] propose a novel hybrid offloading model to solve the heterogeneous resourceconstraint offloading issues in the Cloudlet, concerning the offloading energy and execution efficiency. Li, et al.Â [25] propose a task offloading strategy in the MEC system with a heterogeneous edge by considering the execution and transmission of tasks under the task offloading strategy, we present an architecture for the MEC system. Xiong, et al.Â [26] propose an intelligent task offloading framework in heterogeneous vehicular networks with three VehicletoEverything (V2X) communication technologies, namely Dedicated Short Range Communication (DSRC), cellularbased V2X (CV2X) communication, and millimeter wave (mmWave) communication.However, with the growing attention paid to offloading issues, there are still several issues missing among them: the absence of the accurate robust solution when the dynamics occur in the offloading environment; the absence of the recovery robust solution after the performance deviation brought by the dynamics. During the past ten to twenty years, cloudedge continuum has been further investigated, many new topics attract attention. Among those topics offloading, as an essential part of cloudedge continuum, has been studied [27]. There has been many offloading solutions have been investigated and proposed from different perspectives: using hierarchical method [28], or collaborative optimization method [29], energyefficient method [30]. The optimization performance of the conventional approaches often come from explicit models based on different resources or workflows and corresponding offloading policies models sometimes even a very specific system. As with the increasing popularity, Machine Learningbased optimization solutions also have hence attracted certain research attention [4, 5] in context of offloading. Among Machine Learningbased approaches, Reinforcement Learningbased approaches [5, 6, 17, 31] optimize offloading interactively without asking for data labelling. However, the performance of the approaches aforementioned is depending on and easily influenced by the dynamics from each component of the MEC pipeline: the resource availability, the request pattern, the data transmission latency. Thus, any changes from those parts could lead to performance deviation for those approaches, which asks for repeating of the pruning process or training process when it comes to learningbased solutions. From the robustness perspective, the higher deviation means the lower robustness of the offloading performance. There are some work addressing this issue from robustness perspective: adaptive optimization approach [20], connection stability [32], robust network contention [33]. However, compared with throughput or energy consumption, the offloading robustness among heterogeneous resources environments has not been well addressed lately. In the next section, we will formulate our approach step by step.
PACRL: FiechterÂ [34] first proposed the PAC RL framework, and algorithms with sample complexity O((SAH3/2) log(1/)) have been developedÂ [35, 36], which are minimaxoptimal in timeinhomogeneous MDPsÂ [37]. These algorithms combine a wellchosen halting rule with an optimistic sampling rule. Most optimistic sampling strategies have been presented for regret minimization, where the policy t is the greedy policy with regard to an upper confidence constraint on the optimal Q function. In specifically, episodic MDPs are reached via the UCBVI method of Azar et al.Â [38] (with Bernstein bonuses). Instancedependent upper limits on the regret for optimistic algorithms have been presented in recent publicationsÂ [39,40,41]. A complexity term that is dependent on the MDP instance is present in an instancedependent bound, generally through the idea of a suboptimality gap. In particular, Wagenmaker et al.Â [42] shown that optimistic noregret sampling procedures cannot attain the instanceoptimal rate for PAC identification. The basic idea is that an ideal PAC RL algorithm must visit each stateaction combination at least a specific number of times, necessitating the use of playing strategies that cover the whole MDP in the fewest possible episodes. A regretminimizer, on the other hand, concentrates on using highreward strategies that, depending on the MDP instance, may be arbitrarily ineffective in traveling to remote states.
Methodology
In this section we elaborate the approach we propose: MLRLCDRLO in details. We firstly start with the formulation of latencycritical PACRL:
Latencycritical Probably Approximate Correct (PAC) reinforcement learning
With the conventional Reinforcement Learning set up, there is rare upper bound of offloading accuracy during the exploration process, which leads the optimization to undesired directions, wasting training time. So here we firstly formulate this upper bound of offloading to limit the training time and accuracy more preciscely. When there exit dynamics in the training environment, every time during the transition after the dynamic disturbances, the learning process needs to optimize the offloading policies from the scratch again. During this process, specific upper bound on the exploration of training will save training time and offer better accuracy. And also guidance during the transition process could also save retraining time. To this end, we propose a probably approximate correct RLbased offloading algorithm, which offers upper bound on the exploration process:
where, \(n_{S}\) denotes number of states, \(n_{S}\) represents the number of actions, \(\varepsilon\) is the accuracy parameter, and \(\gamma\) is the discount factor. The proof follows. In our latencycritical PAC reinforcementlearning formulation, we take \(\mathcal {M}\) as a finite Markov Decision Process(MDP), denoted as a tuple \((\mathcal {S}, \mathcal {A}, \mathcal {T}, \mathcal {R}, \Gamma )\). Within \(\mathcal {M}\), we take: \(\mathcal {S}\) as the states set, \(\mathcal {A}\) as the actions sets corresponding to each state, \(\mathcal {T}\) as the transition distribution and represented as: \(\mathcal {S}\times \mathcal {A} \xrightarrow []{}\Lambda _{\mathcal {S}}\) , \(\mathcal {R}\) is the reward distribution, and \(r\in [0, 1)\) is a reward discount factor. \(\mathcal {T}(s^{'}s,a)\) indicates the probability of the transition from states s to state \(s^{'}\) out of the distribution \(\mathcal {T}(s,a)\). Each timestep here is defined as a single time interaction between the learner and the environment. Each time interaction between learning agent and the environment is described as a stateaction pair (s,Â a) including the information of that the learner takes the specific action a from the state s. We use R(s,Â a) to denote the expected reward out of reward distribution \(\mathcal {R}(s,a)\). During the Learning process, the learner accumulates the rewards \(r\sim \mathcal {R}(s,a)\) when takes each action a at state s then transits to next state \(s^{'}\) with the possibility: \(s^{'}\sim \mathcal {T}(s,a)\). By repeating this process, the objective of the learner tries to achieve the objective, which is accumulating possible most reward within possible least times of attempts. A policy set consists of any strategy followed by the learner choosing actions. A stationary policy refers to the policy that produces an action based on only the current state, without considering the previous interaction experiences. For policy \(\pi\), the discounted, infinitehorizon value function from state s is formulated as follows:
where, H represents the number of the steps, which is a positive integer, \(V_{\mathcal {M}}^{\pi }(s,H)\) indicates the accumulated value out of Hstep under policy \(\pi\), starting from state s. Specifically, let \(s_t\) and \(\mathcal {r}_t\) be the \(t^{th}\) encountered state and received reward, respectively, resulting from execution of policy \(\pi\) in MDP \(\mathcal {M}\). Here we define policy model \(\pi\) as nonstationary considering the dependencies among tasks. Here we define \(c=(s_1, a_1, r_1, s_2, a_2, r_2, ...)\) as a learning path of \(\mathcal {A}\). In this manner, at time t the state \(s_t\) is described as a serial stateaction experiences denoted as: \(c_t= (s_1, a_1, \textrm{r}_1,...,s_t)\). Then we derive the expected value functions as follows:
where the expected values take all previous possible policy paths the learner follows. The optimal policy is denoted as \(\pi ^{*}\) and has value functions \(V_{\mathcal {M}}^{*}(s)\) and \(Q_{\mathcal {M}}^*(s,a)\).
Based on the primary definitions, we further define several properties used in PACMDP set up:
Definition of Sample Complexity of Exploration(Kakade 2003) Given an MDP \(\mathcal {M}\), an learning algorithm \(\mathcal {A}\) within \(\mathcal {M}\), for any fixed \(\varepsilon >0\), the sample complexity of exploration of \(\mathcal {A}\) is the number of timesteps t such that the policy at time t, \(\mathcal {A}_t\), satisfies:
Definition of Efficient PACMDP Given an MDP \(\mathcal {M}\) (here we refer the MDP we formulate as aforementioned), an learning algorithm \(\mathcal {A}\) within \(\mathcal {M}\), \(\mathcal {A}\) is an efficient PACMDP (Probably Approximately Correct in Markov Decision Processes) algorithm when, given \(\varepsilon >0\) and \(0<\sigma <0\), \(\mathcal {A}\) satisfies: the pertimestep computational complexity, space complexity, and the sample complexity of \(\mathcal {A}\) are less than some polynomial of \((S,A, 1/\varepsilon , 1/\sigma , 1/(1\gamma ))\), with probability greater than \(1\sigma\). \(\mathcal {A}\) is PACMDP when the definition is relaxed to be without computational complexity requirement.
Definition of Admissible Heuristics Given an MDP \(\mathcal {M}\), an learning algorithm \(\mathcal {A}\) within \(\mathcal {M}\), we define a function:
it is admissible heuristic when it satisfies:
for all \(s\in \mathcal {S}\) and \(a\in \mathcal {A}\).
We also assume that \(U(s,a)\le V_{max}\) for all \((s,a)\in \mathcal {S}\times \mathcal {A}\) and some quantity \(V_{max}\). We set:
since we have: \(V^{*}(s)=max_{a\in \mathcal {A}}Q^*(s,a)\), which is at most \(1/(1\Gamma )\). Therefore, without loss of generality, we assume
for all \((s,a)\in \mathcal {S}\times \mathcal {A}\).
We assume that after each time disturbance of the dynamics, before the new convergence of the training, the offloading policy is an admissible heuristic. Considering that the learner has acted with respect to some experienced stateaction pair (s,Â a). We define n(s,Â a) as the nstep experiences, where the learner takes action a from state s. Throughout the experiences, the received rewards at state s by taking action a: \(\textrm{r}[1],\textrm{r}[2],...,\textrm{r}[n(s,a)]\). Then, the empirical mean reward is:
After taking an action, the learner changes the environment accordingly through this interaction. We describe this process as: the learner has taken action a from state s and immediately transitioned to the state \(s^{'}\) through \(n(s,a,s^{'})\) times actiontaking. Throughout this process, the empirical transition distribution \(\hat{T}(s,a)\) satisfies:
The objective of the learner through the learning process is to maximize the current action value, \(Q(s,\cdot )\) by choosing the specific actions, offloading strategies here, and applying them to the environment. The update step is to solve the following set of Bellman equations:
where \(\hat{R}(s,a)\) denotes the maximumlikelihood estimates for the reward, \(\hat{T}(\cdot s,a)\) indicates transition distribution of stateaction pair (s,a). That is, the computation of \(\hat{R}(s,a)\) and \(\hat{T}(s^{'}s,a)\) in Eq.Â 22, uses only the first \(n(s,a)=m\) samples. \(\hat{R}(s,a)\) and \(\hat{T}(\cdot s,a)\) here are the first m times observations of (s,Â a). So during the transition process, instead of modeling each stateaction pair, we assert their value to be U(s,Â a). U(s,Â a) here is guaranteed to be an upper bound on the true value function as we formulated aforementioned. To simplify the notation, we redefine n(s,Â a) to be minimum of m and number of times stateaction pair (s,Â a) has been experienced.
Proof
Let \(Q_i(s,a)\) denote the actionvalue estimates after the \(i^{th}\) iteration of value iteration. We also have:
Then we have:
\(\square\)
By deriving from the fact: \(\xi _0 \le 1/(1\gamma )\) we get that: \(\xi _i \le \gamma ^{i}/(1\gamma )\). Setting this value to be at most \(\beta\) and solving for i yields \(i\ge \frac{ln(\beta (1\gamma ))}{ln\gamma }\). We claim that:
Note that (25) is equivalent to the statement \(1\gamma \le ln\gamma\), which follows from the identity \(e^{x}\ge 1+x\). Given the previous setup and assumption, as efficient PACRL, to achieve an \(\alpha \)optional policy it is sufficient to run it for iterations number:
The realvalued parameter, \(\varepsilon _1\), that specifies the desired closeness to optimality of the policies produced by value iteration. Based on this, we drive m and \(\varepsilon _1\) with the characterization of other parameters including: \(\varepsilon , \sigma , S, A, \gamma\) in context of the theoretical guarantees about the learning efficiency.
Firstly we give explicitly definition of m and \(\varepsilon _1\) during the learning process and some internal parameters:

1
\(\varepsilon _1\in (0,1)\) is a constant added to value estimate as a bonus value of exploration.

2
m is the number of experiences of a stateaction pair before performing an update.

3
l(s,Â a) denotes the number of samples collected for (s,Â a).

4
AU(s,Â a) represents the the running sum of target values used to update Q(s,Â a) once the learning agent collects enough samples.

5
b(s,Â a) denotes the first timestep for which the first experience of (s,Â a) gets collected for the latest ongoing update attempt.

6
\(FLG(s,a)\in \{0,1\}\) indicated the binary value of sampling action: 1, to collect sample for (s,Â a); 0, not to collect sample for (s,Â a).
Update Rules Formulation At time t, after collecting latest m steps of experiences pairs, including next states \((s_{k_{1}},s_{k_{2}},...,s_{k_{m}})\) in order of \(k_{1}<k_{2}<...<k_{m}\), where \(k_{m}=t\). The received \(i^{th}\) reward is denoted as \(r_i\). Thus, we could describe the update rule of learning agent taking action a from state s at time \(k_i\) as follows:
the condition of a an update is performed is the following equation holds:
Then to simplify the calculation, the learning agent only calculates the updates when the FLG(s,Â a) is 1 (true), decreasing the update attempts to finite times. The conditions of turning FLG(s,Â a) to be true are: firstly, initialization set up. Secondly, when any stateaction pair is updated. Conditions of turning FLG(s,Â a) from true to false is when no updates are made during a length of time for which (s,Â a) is experienced m times and the next attempted update of (s, a) fails. In this way no more attempted updates of (s,Â a) are allowed until another actionvalue estimate is updated.
As shown in Fig.Â 3, we describe the overall learning process step by step. In general, the learning agent samples m steps in different environments for exploration then turn to the exploitation process. After finishing the learning process within each environment, the learning agent turns to another environment, repeating the same learning period. Once the dynamics appear, the learning agent also sample just first m samples in the new environment, doing the exploration and exploitation with the upper bound \(\tilde{O}(n_{S}^2\times n_{A}/({\varepsilon }^3(1\gamma )^{6}))\). In this way, the learning agent is able to keep the learning process always with the upper bound. Especially during the process right after the dynamics, the fixed sampling complexity and exploration upper bound helps against the influences from the newly changed environment.
Formulation of latencycritical PACRL
In this section, we continue the formulation of the learning process one step further to the formulation of the Reinforcement Learning and the Meta Learning. Based on the MDP \(\mathcal {M}\) aforementioned, we formulate the RL part as follows:

1
State: The needed state information of a task, \(ta_i\), during the offloading process includes the encoded DAG dependencies and the corresponding offloading plans. The detailed state definition is as follows:
$$\begin{aligned} \mathcal {S}:=\{s_is_i=(\mathcal {D}=(\overrightarrow{TA},\overrightarrow{ED}),Pol_{1:i})\},\ i\in [1,\overrightarrow{TA}], \end{aligned}$$(29)where \(\mathcal {D}=(\overrightarrow{TA},\overrightarrow{ED})\) is a sequence of task embedding and \(Pol_{1:i}\) is the offloading policy of the tasks scheduled before \(ta_i\). Based on the above definition and formulation, we definite the offloading policy of \(ta_i\): \(Pol(a_i\mathcal {D}=(\overrightarrow{TA},\overrightarrow{ED}),A_{1:{i1}})\) as follows:
$$\begin{aligned}&Pol(A_{1:n}\mathcal {D}=(\overrightarrow{TA},\overrightarrow{ED}))\nonumber \\&=\sum \limits _{i=1}^{n} Pol(a_i\mathcal {D}=(\overrightarrow{TA},\overrightarrow{ED}),A_{1:{i1}}) \end{aligned}$$(30) 
2
Action: The offloading choice of each task is a constant value, which indicates: execution locally, execution on different VMs with different resources. By adding up the actions of all the tasks we get the action space \(\mathcal {A}\).

3
Reward: Throughout the learning process of offloading, minimizing latency \(Lat_{A_{1:n}^c}\), defined in Eq.Â 9 is the primary objective. To achieve this, we formulate the reward function into an estimated negative increment of the latency calculated every execution of an offloading decision taken for a task. The detailed definition is as follows:
$$\begin{aligned} \Delta Lat_i^c=Lat_{A_{1:i}^c } Lat_{A_{1:i1}^c} \end{aligned}$$(31)
More detailed offloading policy model learning paradigm with aforementioned three parts is shown in Fig.Â 4. In our proposed training paradigm, we build up both encoder and decoder based on recurrent neural networks(RNN) [43] to learn the dependencies among tasks. First we apply the tasks embedding, which is the input of the encoder. We define \(\mathcal {F}_{en}\) as the encoding function, the each step output of the encoder, \(e_i\), is correspondingly formulated as:
To make sure decoder learn from different part of the source sequence without information loss, we apply the attention mechanism [44]. The output of the encoder is the input of the decoder, where we define the decoding function as \(\mathcal {F}_{de}\). After decoder we get the offloading policies for the workflows,\(d_j\). The decoding process is as follows:
where \(c_j\) is the context vector at decoding step j and is computed as a weighted sum of the encoder as follows:.
The weight \(\alpha _{ji}\) of each output of encoder, \(e_i\) is computed by
where \(f(d_{j1},e_i)\), is used to calculate the percentage that how much possibility the input at position i matches the output at position j. Regrading the structure of NN(Neural Network), we adopt the sequencetosequence neural network [45], which is good at learning context information. The policy learned by NN is formulated as \(Pol(a_js_j)\). The value function is formulated as \(v_{Pol}(s_j)\). The action \(a_j\) is determined based on the following calculation:
Formulation of MLRLCDRLO
Based on the aforementioned PACRL formulation, we then optimize robustness concern by integrating MetaLearning optimization part [46]. As to MetaLearning optimization part, we have two loops of training: inner loop and outer loop, which we will elaborate in the following part. Overall we define the objective function based on Proximal Policy Optimization (PPO) [47]:
where, \(\pi _{\theta _i^o}\) is the sample policy, \(\theta _i^o\) is the vector of parameters of the sample policy network, \(\pi _{\theta _i}\) is the target policy, where \(\theta _i\) equals to \(\theta _i^o\) at the initial epoch. \(Pr_t\) is the probability ratio between the sample policy and target policy, which is defined as
We also define a function \(slice_{1\epsilon }^{1+\epsilon }(Pr_t)\) to remove the incentive for moving \(Pr_t\) outside the interval \([1\epsilon ,1+\epsilon ]\) giving specific limit to the value of \(Pr_t\).
We formulate our advantage fucntino based on general advantage estimator (GAE) [48]. The detailed formulation which is as follows:
where \(\hat{A}_t\) denotes the advantage function value at time step t, \(\lambda \in [0,1]\) is used to control the tradeoff between bias and variance.
Overall, we define the objective function for each inner layer task learning as:
where \(c_1\) is the coefficient of value function loss. The outer layer objective is expressed as:
where \(\theta _i^{'}=Up_{\tau \sim P_{\mathcal {T}_i}(\tau ,\theta _i)}(\theta _i,\mathcal {T}_i), \theta _i=\theta\). We adopt the fistorder to approximate the secondorder derivatives to save some calculation, which is defined as follows:
where we get n samples learning tasks in the outer loop, \(\alpha\) is the learning rate of inner loop training, and m is the conducted gradient steps for the inner loop training.
Algorithm
This section describes the detailed process of the MLRLCDRLO algorithm, integrating and going through each part of the methodology formulated previously. As is shown in AlgorithmÂ 1, the input includes distribution over tasks, learning rates of the outer and inner loop. The meta policy neural network parameters are denoted as \(\theta\). We firstly sample a batch of learning tasks \(\mathcal {T}\) with batch size n and conduct inner loop training for each sampled learning task. The inner loop training is conducted based on the PACReinforcementLearning we formulated aforementioned. The first step is the initialization of the algorithm: setting the initial parameters of the policy model and resetting the data set \(\mathcal {D}\). Then is the sampling step: based on the number of environments, N data trajectories are sampled from the distribution \(\Lambda\) according to the current policy model and added to the data set \(\mathcal {D}\). The following inner layer learning loop from are PACRLbased learning processes; sampling data sets \({\tau }_H\) inside \(\mathcal {D}\) to calculate updated \(\theta '_H\) based on each loss function with PACRL. When the PACRL converges or reaches the upper bound of the exploration, unlike conventional RL or other learning methods, the overall policy model is not updated by inner layer learning agent. After achieving updated \(\theta '_H\), RL agent uses \(\theta '_H\) model to sample new data samples \(\tau '_H\) from \(\mathcal {D}\). After this, the algorithm turns to the outer learning layer, and the meta learner uses \(\theta '_H\) to calculate loss function based on \(\tau '_H\) to achieve an update of the overall policy model. In the next section we will evaluate MLRLCDRLOâ€™s performance.
Evaluation
Evaluation Measurements
We define the measurements as follows:
Offloading LatencyCritical Measurements We define several measurements to indicate and compare different experimental results and investigate different metrics specifically. One group is related to latency missing rate and offloading performance:
QoSLatencyCritical Rate (QLCR) [6]: total percentage of executed tasks that meet latency required by QoS.
ExpectedLatencyCritical Rate (ELCR) [6]: total percentage of executed tasks that meet expected latency. ELCR indicates the level of latencycritical for each method.
Necessary Training Iterations(NTI): the training iterations needed for convergence of policy model in an environment.
Offloading Robustness Measurements Robustness measurements includes: Dynamic Pressure Index(DPI), Offloading Performance Deviation(OPD) and Adaptation Steps and Data Usage for Performance Recovery(ASDUPR). They are formulated as follows: Dynamic Pressure Index(DPI): the indicator of the dynamic level of the current environment, including portion of workload change, latency change. It is defined as follows:
where \(WOR_{before}, WOR_{after}\) denotes the instant workload before and after the dynamics respectively. DPI shows the pressure level the system currently is having brougt by the dynamics. For OPD:
where, \(PER_{after}\) denotes the instant average offloading latency after the influence of dynamic, \(PER_{before}\) indicates the previous converged average offloading latency value. Besides the instant performance deviation, ASDUPR is proposed to describe adaptation, includes time and data iteration needed for adaptation after performance deviation incurred by dynamics:
where ITER demonstrates the iteration time, \(t^o\) describes time spent for each iteration.
Based on the metrics defined previously, we implement comprehensive evaluation to validate robustness of MLRLCDRLO. Throughout the implementations do we aim to evaluate our proposed MLRLCDRLO in next section.
Set up
The configuration of the implementation consists of two parts: the configuration of the platform, shown in TableÂ 2, and the configuration of simulation model, shown in TableÂ 3.
Simulation Environment: We consider a cellular network, where the data transmission rate varies with the UE position. The CPU clock speed of UE, \(f_{UE}\) is set to 1GHz. There are four cores in each VM of the MEC host with a CPU clock speed of 2.5 GHz per core. The CPU clock speed of a VM, \(f_{VM}\) is \(4\times 2.5=10\) GHz. We implement a synthetic DAG generator according to [20] based on four parameters: n, fat, density, and ccr, where n represents the task number, fat controls the width and height of the DAG, density decides the number of edges between two levels of the DAG, and ccr denotes the ratio between the communication and computation cost of tasks (TableÂ 4 and 5).
Results
As is shown in TableÂ 6, we change same share of workload to show and compare the latencycritical offloading performance of our MLRLCDRLO algorithm against finetuned DQN, DoubleDQN and CEM approaches on the same DAG data. More specifically, we add dynamic to scheduling by increasing workload for each method while keeping the the same resource availability setup. Then we assess the average latency rates of scheduled tasks to compare the performance robustness of the proposed MLRLCDRLO offloading against other methods. As is shown, we put the items in bold, which perform best in each row. Overall, compared with the finetuned DQN, DoubleDQN and CEM methods, our algorithm MLRLCDRLO offers more stable latencycritical offloading performance every time after dynamic influence in the environment. More specifically, MLRLCDRLO outperforms the finetuned DQN, DoubleDQN and CEM approaches in the latency rates and necessary training iteration. From the perspective of latency rate, averagely more than \(95.33\%\pm 0.34\%\) tasks offloaded by MLRLCDRLO finish their execution with the shorter latency than the expected latency 720ms [20]. While tasks offloaded by other RLbased methods finish their execution averagely with a range of \(8.520\%\) violation rate of latency requirement. Moreover, when given heavier workflows (topology 2, n=30, UT=DT=5.5Mbps), shown in TableÂ 6, our method MLRLCDRLO still offers workflows more stable latencycritical offloading performance, which is more percentage of tasks finish execution with the lower latency than the expected one under dynamics from the environments. After all MLRLCDRLO outperforms the finetuned DQN, DoubleDQN and CEM in performance stability.
As shown in Figs.Â 5 and 6, we show the change of OPD and ASDUPR under different dynamics to show the robustness of MLRLCDRLO from the perspective of offloading performance stability and the expense taken for the recovery from the performance deviation. Firstly, from the perspective of offloading performance stability, as shown in Fig.Â 5: we increase the workload with same percentage for all the offloading approaches. The performance deviation of our proposed MLRLCDRLO always remains stable within 55% throughout different workload environments, in some environment the deviation is even under 25%. In contrast, finetuned DQN, DoubleDQN and CEM approachesâ€™ performance deviation range is rather broader between 50% and even beyond 300% with the same portion of increased workload. Therefore, the offloading performance stability of our MLRLCDRLO outperforms the conventional RLbased offloading approaches. Then from the perspective of performance deviation recovery, as shown in Fig.Â 6 where we compare the adaptation speed or performance recovery speed discounted by the performance deviation portion, which balances the adaptation speed and robustness performance. As shown, we cloud see that the adaptation speed of MLRLCDRLO is more than five times faster than finetuned DQN, DoubleDQN and CEM averagely after every time increase of workload, at some point, even more, proving its robustness to dynamics of the environment.
Discussion
As shown in the result section, compared with conventional RLbased approaches, our proposed offloading approach MLRLCDRLO shows advantages in terms of the offloading performance robustness and recovery speed after influence from dynamics among heterogeneous environments. More specifically, the offloading performance deviation and adaptation speed of our proposed approach MLRLCDRLO show a stable pattern of change during increased workload. When the dynamic change of the DPI is within 30%50%, both offloading performance deviation and adaptation speed increase with DPI; When the dynamic change of the DPI is in a range of 30%50%, both offloading performance deviation and adaptation speed decrease with DPI; When the dynamic change of the DPI is beyond 50%, both offloading performance deviation and adaptation speed increase again. Overall, when the DPI is within 50%, MLRLCDRLO could stay robustness with lower than 30% performance deviation. When DPI goes beyond 50%, the performance deviation of MLRLCDRLO still stay within 50%. The robustness starts to decrease when DPI beyond 50% but still with lower than 50% performance deviation, much lower than finetuned RL methods (more than 300%). One of our future work directions is to expand the robustness range against the dynamics, that is keeping lower performance deviation against wider range of DPI change. Another direction of future work is to reduce the instant offloading performance deviation right after the DPI changes. Furthermore, by investigating exploration strategies of the RL framework, we could better control the training time and accuracy. We are currently investigating the exploration and exploitation accuracy of RLbased approaches.
Regarding the superiority achieved by our proposed methods, there are two main aspects of the insight, the first one is the merit that Metalearning can leverage prior knowledge from previous tasks to improve learning on new tasks. In this way, the prior offloading knowledge can be accumulated and transferred to the following phase. By analyzing patterns and relationships across multiple environments, a metalearning model can identify commonalities and transfer knowledge from one environment to another. This transfer learning can help a model learn new offload patterns more efficiently and effectively. Also, metalearning can help avoid overfitting to specific training data by learning a more generalizable learning strategy. By training on multiple tasks, the metalearning model can learn to generalize across tasks and avoid overfitting to specific examples. This can lead to a more adaptive model that can perform well on a wide range of tasks and data.
The other aspect is Probably Approximate Correct (PAC), which is a framework in machine learning that aims to balance the accuracy of a model with the amount of data needed to achieve that accuracy. The PAC framework provides a way to measure the sample complexity of a learning algorithm, which is the number of training examples needed to achieve a certain level of accuracy. One advantage of the PAC framework is that it can lead to faster convergence of learning algorithms. The PAC framework is designed to ensure that a learning algorithm will be able to generalize well from the training data to new, unseen data. To achieve this, the PAC framework requires that the algorithm be able to achieve a certain level of accuracy with high probability, meaning that the algorithm should be able to correctly classify most of the test examples with high confidence. This requirement ensures that the algorithm will perform well on new data, even if it has not seen those examples during training.
In addition, the PAC framework provides a way to measure the sample complexity of a learning algorithm. This measure is based on the required level of accuracy and the confidence level, and provides a way to estimate the number of training examples needed to achieve the desired level of accuracy. This allows researchers to compare different learning algorithms and choose the one with the lowest sample complexity, which can lead to faster convergence and more efficient learning.
As of the limitation of our work, one is the realtime adaptation efficiency. Currently, the algorithm is trained offline then adapt to a new environment. We plan to integrate an onlineoffline switch scheme in the future to improve the realtime adaptation efficiency. Also, more implementations of real world data is also part of our future steps.
Conclusion
In this work, MLRLCDRLO, a robust task scheduling framework, is presented to offer latencyguaranteed scheduling for timecritical tasks and improve the scheduleâ€™s robustness in the meantime. We propose a metagradient robust reinforcement learning framework to quickly adapt a scheduling policy model to a newly changed environment while using a PACbased latencycritical RL scheme to maintain the latency guarantee. Experimental results show that our approach can provide the latency guarantee, outperforming finetuned RL methods. Furthermore, our MLRLCDRLO approach finishes adaptation in new environments using fewer training iterations, 2\(\times\) to 5\(\times\) faster than the finetuned RL approach, achieving better robustness while offering latency guarantees.
Availability of data and materials
Not applicable.
References
Wu L, Liu M, Wang XM, Chen Gh, Hg Gong (2011) Mobile distributionaware data dissemination for vehicular ad hoc networks. Ruanjian Xuebao/J Softw 22(7):1580â€“1596
Pham QV, Fang F, Ha VN, Piran MJ, Le M, Le LB, Hwang WJ, Ding Z (2020) A survey of multiaccess edge computing in 5g and beyond: Fundamentals, technology integration, and stateoftheart. IEEE Access 8:116974â€“117017
Song C, Liu M, Cao J, Zheng Y, Gong H, Chen G (2009) Maximizing network lifetime based on transmission range adjustment in wireless sensor networks. Comput Commun 32(11):1316â€“1325
Yu S, Wang X, Langar R (2017) Computation offloading for mobile edge computing: A deep learning approach. In: 2017 IEEE 28th Annual International Symposium on Personal, Indoor, and Mobile Radio Communications (PIMRC). IEEE,Â New York, pp 1â€“6
Wang J, Hu J, Min G, Zhan W, Ni Q, Georgalas N (2019) Computation offloading in multiaccess edge computing using a deep sequential model based on reinforcement learning. IEEE Commun Mag 57(5):64â€“69
Liu H, Chen P, Zhao Z (2021) Towards a robust metareinforcement learningbased scheduling framework for time critical tasks in cloud environments. In: 2021 IEEE 14th International Conference on Cloud Computing (CLOUD). IEEE, pp 637â€“647
Liu H, Xin R, Chen P, Zhao Z (2022) Multiobjective robust workflow offloading in edgetocloud continuum. In: 2022 IEEE 15th International Conference on Cloud Computing (CLOUD). IEEE, pp 469â€“478
Singh S, Dhillon HS, Andrews JG (2013) Offloading in heterogeneous networks: Modeling, analysis, and design insights. IEEE Trans Wirel Commun 12(5):2484â€“2497
Zhang K, Mao Y, Leng S, Zhao Q, Li L, Peng X, Pan L, Maharjan S, Zhang Y (2016) Energyefficient offloading for mobile edge computing in 5g heterogeneous networks. IEEE Access 4:5896â€“5907
Chen C, Li H, Li H, Fu R, Liu Y, Wan S (2022) Efficiency and fairness oriented dynamic task offloading in internet of vehicles. IEEE Trans Green Commun Netw 6(3):1481â€“1493. https://doi.org/10.1109/TGCN.2022.3167643
Chen C, Zeng Y, Li H, Liu Y, Wan S (2023) A multihop task offloading decision model in mecenabled internet of vehicles. IEEE Internet Things J 10(4):3215â€“3230. https://doi.org/10.1109/JIOT.2022.3143529
Wei W, Yang R, Gu H, Zhao W, Chen C, Wan S (2022) Multiobjective optimization for resource allocation in vehicular cloud computing networks. IEEE Trans Intell Transp Syst 23(12):25536â€“25545. https://doi.org/10.1109/TITS.2021.3091321
Ye Y, Hu RQ, Lu G, Shi L (2020) Enhance latencyconstrained computation in mec networks using uplink noma. IEEE Trans Commun 68(4):2409â€“2425
Feng J, Pei Q, Yu FR, Chu X, Shang B (2019) Computation offloading and resource allocation for wireless powered mobile edge computing with latency constraint. IEEE Wirel Commun Lett 8(5):1320â€“1323
Meng H, Chao D, Guo Q (2019) Deep reinforcement learning based task offloading algorithm for mobileedge computing systems. In: Proceedings of the 2019 4th International Conference on Mathematics and Artificial Intelligence.Â Association for Computing Machinery,Â New York, pp 90â€“94
Min M, Xiao L, Chen Y, Cheng P, Wu D, Zhuang W (2019) Learningbased computation offloading for iot devices with energy harvesting. IEEE Trans Veh Technol 68(2):1930â€“1941
Dinh TQ, La QD, Quek TQ, Shin H (2018) Learning for computation offloading in mobile edge computing. IEEE Trans Commun 66(12):6353â€“6367
Cheng N, Lyu F, Quan W, Zhou C, He H, Shi W, Shen X (2019) Space/aerialassisted computing offloading for iot applications: A learningbased approach. IEEE J Sel Areas Commun 37(5):1117â€“1129
Li M, Yu FR, Si P, Wu W, Zhang Y (2020) Resource optimization for delaytolerant data in blockchainenabled iot with edge computing: A deep reinforcement learning approach. IEEE Internet Things J 7(10):9399â€“9412
Wang J, Hu J, Min G, Zomaya AY, Georgalas N (2020) Fast adaptive task offloading in edge computing based on meta reinforcement learning. IEEE Trans Parallel Distrib Syst 32(1):242â€“253
Cao Z, Zhou P, Li R, Huang S, Wu D (2020) Multiagent deep reinforcement learning for joint multichannel access and task offloading of mobileedge computing in industry 4.0. IEEE Internet Things J 7(7):6201â€“6213
Lu H, Gu C, Luo F, Ding W, Liu X (2020) Optimization of lightweight task offloading strategy for mobile edge computing based on deep reinforcement learning. Futur Gener Comput Syst 102:847â€“861
Lolos K, Konstantinou I, Kantere V, Koziris N (2017) Elastic management of cloud applications using adaptive reinforcement learning. In: 2017 IEEE International Conference on Big Data (Big Data). IEEE, pp 203â€“212
Guan S, Boukerche A, Loureiro A (2020) Novel sustainable and heterogeneous offloading management techniques in proactive cloudlets. IEEE Trans Sustain Comput 6(2):334â€“346
Li W, Jin S (2021) Performance evaluation and optimization of a task offloading strategy on the mobile edge computing with edge heterogeneity. J Supercomput 77(11):12486â€“12507
Xiong K, Leng S, Huang C, Yuen C, Guan YL (2020) Intelligent task offloading for heterogeneous v2x communications. IEEE Trans Intell Transp Syst 22(4):2226â€“2238
Mach P, Becvar Z (2017) Mobile edge computing: A survey on architecture and computation offloading. IEEE Commun Surv Tutor 19(3):1628â€“1656
Zhao Z, Zhao R, Xia J, Lei X, Li D, Yuen C, Fan L (2019) A novel framework of threehierarchical offloading optimization for mec in industrial iot networks. IEEE Trans Ind Inform 16(8):5424â€“5434
Huang M, Liu W, Wang T, Liu A, Zhang S (2019) A cloudmec collaborative task offloading scheme with service orchestration. IEEE Internet Things J 7(7):5792â€“5805
Yang X, Yu X, Huang H, Zhu H (2019) Energy efficiency based joint computation offloading and resource allocation in multiaccess mec systems. IEEE Access 7:117054â€“117062
Chen X, Zhang H, Wu C, Mao S, Ji Y, Bennis M (2018) Optimized computation offloading performance in virtual edge computing systems via deep reinforcement learning. IEEE Internet Things J 6(3):4005â€“4018
Chen M, Guo S, Liu K, Liao X, Xiao B (2020) Robust computation offloading and resource scheduling in cloudletbased mobile cloud computing. IEEE Trans Mob Comput 20(5):2025â€“2040
HyytiÃ¤ E, Spyropoulos T, Ott J (2015) Offload (only) the right jobs: Robust offloading using the markov decision processes. In: 2015 IEEE 16th international symposium on a world of wireless, mobile and multimedia networks (WoWMoM). IEEE, pp 1â€“9
Fiechter CN (1994) Efficient reinforcement learning. In: Proceedings of the seventh annual conference on Computational learning theory. pp 88â€“97
Dann C, Li L, Wei W, Brunskill E (2019) Policy certificates: Towards accountable reinforcement learning. In: International Conference on Machine Learning. PMLR, pp 1507â€“1516
MÃ©nard P, Domingues OD, Jonsson A, Kaufmann E, Leurent E, Valko M (2021) Fast active learning for pure exploration in reinforcement learning. In: International Conference on Machine Learning. PMLR, pp 7599â€“7608
Domingues OD, MÃ©nard P, Kaufmann E, Valko M (2021) Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited. In: Algorithmic Learning Theory. PMLR, pp 578â€“598
Azar MG, Osband I, Munos R (2017) Minimax regret bounds for reinforcement learning. In: International Conference on Machine Learning. PMLR, pp 263â€“272
Simchowitz M, Jamieson KG (2019) Nonasymptotic gapdependent regret bounds for tabular mdps. Adv Neural Inf Process Syst 32:1153â€“1162
Xu H, Ma T, Du S (2021) Finegrained gapdependent bounds for tabular mdps via adaptive multistep bootstrap. In: Conference on Learning Theory. PMLR, pp 4438â€“4472
Dann C, Marinov TV, Mohri M, Zimmert J (2021) Beyond valuefunction gaps: Improved instancedependent regret bounds for episodic reinforcement learning. Adv Neural Inf Process Syst 34:1â€“12
Wagenmaker AJ, Simchowitz M, Jamieson K (2022) Beyond no regret: Instancedependent pac reinforcement learning. In: Conference on Learning Theory. PMLR, pp 358â€“418
Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regularization. arXiv preprint arXiv:1409.2329
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst 27
Finn C, Abbeel P, Levine S (2017) Modelagnostic metalearning for fast adaptation of deep networks. In: International conference on machine learning. PMLR, pp 1126â€“1135
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347
Schulman J, Moritz P, Levine S, Jordan M, Abbeel P (2015) Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438
Arabnejad H, Barbosa JG (2013) List scheduling algorithm for heterogeneous systems by an optimistic cost table. IEEE Trans Parallel Distrib Syst 25(3):682â€“694
Funding
This work is funded by the European Unionâ€™s Horizon 2020 projects: ARTICONF (Grant No. 825134), ENVRIFAIR project (Grant No. 824068), BLUECLOUD (Grant No. 862409), Bluecloud2026(Grant No. 101094227) and LifeWatch ERIC, the Natural Science Foundation of Shaanxi (Grant No. 2022JQ651), China Scholarship Council, Science and Technology Program of Sichuan Province (Grant No.2020YFG0326), and Talent Program of Xihua University (Grant No.Z202047).
Author information
Authors and Affiliations
Contributions
Hongyun Liu: Conceptualization, Methodology, Software, Writing Original draft preparation. Ruyue Xin: Conceptualization, Methodology, Software, Writing Original draft preparation. Peng Chen: Conceptualization, Methodology, Software, Writing, Data curation, Writing Original draft preparation.Hui Gao: Methodology, Software, Writing, Data curation, Writing. Paola Grosso: Supervision, Writing Reviewing and Editing. Zhiming Zhao: Supervision, Writing Reviewing and Editing. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisherâ€™s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Liu, H., Xin, R., Chen, P. et al. RobustPAC timecritical workflow offloading in edgetocloud continuum among heterogeneous resources. J Cloud Comp 12, 58 (2023). https://doi.org/10.1186/s13677023004346
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13677023004346