UvA-DARE (Digital Academic Repository) Robust-PAC time-critical workflow offloading in edge-to-cloud continuum among heterogeneous resources

Edge-to-cloud continuum connects and extends the calculation from edge side via network to cloud platforms, where diverse workflows go back and forth, getting executed on scheduled calculation resources. To better utilize the calculation resources from all sides, workflow offloading problems have been investigating lately. Most works focus on optimizing constraints like: latency requirements, resource utilization rate limits, and energy consumption bounds. However, the dynamics among the offloading environment have hardly been researched, which easily results in uncertain Quality of Service(QoS) on the user side. Any part of the workload change, resource availability change or network latency could incur dynamics in an offloading environment. In this work, we propose a robust PAC (probably approximately correct) offloading algorithm to address this dynamic issue together with optimization. We train an LSTM-based sequence-to-sequence neural network to learn how to offload workflows in edge-to-cloud continuum. Comprehensive implementations and corresponding comparison against state-of-the-art methods demonstrate the robustness of our proposed algorithm. More specifically, our algorithm achieves better offloading performance regarding dynamic heterogeneous offloading environment and faster adaptation to newly changed environments than fine-tuned state-of-the-art RL-based offloading methods.


Introduction
Wide use of edge-to-cloud continuum promotes a novel paradigm empowering intelligent and diverse applications in our daily life: intelligent transportation, intelligent home, and E-Healthcare.However, such a paradigm also brings new challenges: the growing computation requirements on the user side, increasing data transmission, continuous interactive computation, and communication.With this trend, task offloading is a very widely used approach to better utilize diverse computation resources both on the edge side and cloud side, which contribute to an extended calculation pipeline togetheredge-to-cloud continuum.Within the popularity of the edge-to-cloud continuum, how to offload workflows properly matters in many contexts: energy consumption, latency control, and QoS.Moreover, with the evolution of the cellular network [1], the overall number of end-users is increasing dramatically [2,3].With the rocketing development on both sides of users and service suppliers, offloading gains importance in a more heterogeneous environment where nodes have diverse capacities.The execution becomes more complicated with more resource options.Optimization on the edge side takes many aspects into account at the same time: execution capability, execution time, which are often contradicting against each other.
To address this NP-hard problem, many works have been done [4][5][6].Among them, machine learning-based approaches especially Reinforcement-Learning(RL)based approaches have been investigated a lot: Liu, et al. [6] proposed a robust scheduling framework for independent tasks.Liu, et al. [7] proposed a multi-objective optimization framework for time-critical task scheduling.There also have been many works addressing heterogeneity in the offloading environment [8,9].Chen, et al. [10] propose an end-edge-cloud architecture of vehicles for task computation offloading, where considers three task computing methods.For the dynamically changing environment in the IoV, they adopt an Asynchronous Advantage Actor-Critic (A3C) based computation offloading algorithm to solve the problem and seek optimal offloading decisions.As workflows consist of tasks and their dependencies, when the tasks come with time-critical constraints the workflows also need to take these constraints into account.Chen, et al. [11] develop a distributed multihop task offloading decision model for task execution efficiency, which consists of two parts: 1) a candidate vehicle selection mechanism for screening the neighboring vehicles that can participate in offloading and 2) a task offloading decision algorithm for obtaining the task offloading solution.Wei, et al. [12] improve the nondominated sorting genetic algorithm II (NSGA-II) by modifying the initial population according to the matching factor, dynamic crossover probability and mutation probability to promote excellent individuals and increase population diversity.Therefore, when we optimize offloading policies, we also need to meet the time-critical or latency requirements of those workflows [13,14].
However, after reviewing related papers and work done lately, we find that the robustness of the offloading performance has rarely been addressed in a dynamic heterogeneous resource edge-to-cloud continuum environment.The robustness of offloading performance refers to the stability of the offloading performance in a dynamic environment, regarding performance measurements.The absence of robustness results in offloading performance deviation, which brings in the uncertainties to latency.Furthermore, the uncertain latency influences the QoS even end up in violation of Service Legal Agreement(SLA).In our work, we propose a Meta-PAC(probably approximately correct)-Reinforcement-Learning-based robust offloading algorithm(MLR-LC-DRLO) to address this issue in a heterogeneous environment.The main contributions of this paper include: 1 Workflow offloading in the heterogeneous environment: we build up a heterogeneous environment to investigate workflow offloading. 2 Time-critical workflow offloading: we design a PAC Reinforcement-Learning scheme to learn offloading policy.The learning process is with maximum exploration limit, which is based on workflow latency.In this way it offers offloading latency guarantee and makes the learning process more efficient.3 Robust workflow offloading: we propose a Meta-Learning-based offloading algorithm, achieving more robust offloading performance compared with typical RL-based offloading approaches.
In the remainder of this paper, firstly we give the general formulation of the offloading in Problem formulation section.Followed by Related work section, where we go through the related work.Then we propose the detailed framework and algorithm MLR-LC-DRLO in Methodology section.Next, we evaluate the robustness performance and optimization performance with comprehensive implementations in Evaluation section.We further discuss the implementation results and make future work plans in Discussion section.Finally, Conclusion section summarizes the whole paper.

Problem formulation
We formulate offloading in a typical use case, as shown in Fig. 2, the workflow including the requests and corresponding dependencies firstly go to the local scheduler.After local scheduler makes the decision whether to calculate the request locally or offload them to MEC host.
Between MEC host and end users, there is the MEC network connecting the two parts, including the up link and down link.Then if the decision is to offload the request to the MEC host, the request will be transmitted to the MEC host, where the offloading orchastrator will allocate them to different VMs through gateways.In this work, the resource composition of each VM on MEC host side is heterogeneous.
After we present the typical offloading pipeline, we formulate each part of the pipeline step by step.First of all, it's the workflow model.As is known, workflows consist of tasks and their dependencies.Here we define the workflow model as D = (TA, − → ED) , where we use TA to represent the tasks set, based on this we use the vector − → ED repre- sents the dependencies, which are described as directed edge connected between the tasks respectively.We take − → ed = (ta i , ta j ) as an example, where − → ed denotes the dependency between task ta i and task ta j meaning ta j is an immediate successor task of ta i .We also formulate several principles of workflow models as follows: 1 As is shown in Fig. 1, for the two connected tasks, for example A and B, the one starts its execution earlier (A) is the leading task, the other one (B) is successor task. 2 The execution of a successor task only starts later than the ending of its leading task's execution until the last one.3 The tasks have no successor tasks are the exit tasks.
For different use cases and applications, the VMs and containers are getting more diverse, that is where the offloading heterogeneity comes.Based on the formulation of the workload, the heterogeneity of the environment comes from the heterogeneous resource composition of each VMs.Here we define ξ type of VMs, their computa- tion capacities are represented as Cap l , l ∈ [1, 2, 3, ..., ξ ] .For each task ta i , it has several information including: the resource requirement for running task, Cp i , the sent data sizes, Da s i , and the received result data size , Da r i .After we formulate the tasks model and the VMs, we turn to the MEC model, which consists of: the wireless up-link channel transmission rate, UT, and the down-link channel transmission rate DT.Based on this formulation, the latency of task ta i sending data, Lat U i , is calculated as: getting executed on the MEC host, Ex s i , is calculated as: receiving the result data, Lat D i , is calculated as: (1) When a task ta i gets scheduled to be executed locally, the latency is just the time spent on local execution on the enduser side, which is calculated as where Cap Lo represents the computational capacity of the end-user.
Once a task ta i gets offloaded to the MEC host, the total latency are the sum of latency from all parts, which includes local processing, up-link transmission, and remote processing latency and results transmission latency, as shown in Fig. 2. Based on the aforementioned model, we further formulate the offloading policy into Pol 1:n = a 1 , a 2 , ..., a n , where a i represents the corresponding offloading decision of each ta i .
The finishing time of the process on the up-link channel, T U i , are defined as: The finishing time of ta i 's execution on the MEC host, FT s i , and finishing time of its process on the down-link channel, FT D i are defined as: The completion time of task ta i on the end user side, FT UE i , are defined as: Overall, given a offloading policy model Pol 1:n , the total latency of a DAG, Lat c A 1:n , is defined as: (4) where K denotes the exit tasks set, which consists of the tasks which have no successor tasks.In the next section, we will propose the detailed offloading algorithm based on the model formulation (Table 1).

Related work
Learning-Based Offloading Han, et al. [15] proposes a deep reinforcement learning-based approach to offloading decision-making in mobile edge computing.Min, et al. [16]proposed a deep RL-based offloading enabling the IoT device to optimize the offloading policy without knowledge of the MEC model, the energy consumption model, and the computation latency model.Dinh, et al. [17] proposed a model-free reinforcement learning offloading mechanism which helps MUs learn their long-term offloading strategies to maximize their longterm utilities.Cheng, et al [18] propose a deep reinforcement learning-based computing offloading approach to learn the optimal offloading policy on-the-fly, where we adopt the policy gradient method to handle the large action space and actor-critic method to accelerate the learning process.Some work also adopted LSTM network the to do prediction of the environment state [19].Meta-Learning has also been investigated [20] to offer an fast adaptive offloading method-MRLCO.Cao et al. proposed a novel multi-agent DRL based approach [21], which adopts act-critic neural networks to calculate Q-value based on corresponding reward function.DPM framework proposed by [22] applied the long short-term memory (LSTM) neural network investigated the prediction and strategies of resource allocation under the objective of energy consumption reduction in cloud-edge continuum.Some work also adopted LSTM network the to do prediction of the environment state [19].Meta-Learning has also been investigated [20] to offer an fast adaptive offloading method-MRLCO.Cao  Resource Heterougeneity Guan, et al. [24] propose a novel hybrid offloading model to solve the heterogeneous resource-constraint offloading issues in the Cloudlet, concerning the offloading energy and execution efficiency.Li, et al. [25] propose a task offloading strategy in the MEC system with a heterogeneous edge by considering the execution and transmission of tasks under the task offloading strategy, we present an architecture for the MEC system.Xiong, et al. [26] propose an intelligent task offloading framework in heterogeneous vehicular networks with three Vehicle-to-Everything (V2X) communication technologies, namely Dedicated Short Range Communication (DSRC), cellular-based V2X (C-V2X) communication, and millimeter wave (mmWave) communication.However, with the growing attention paid to offloading issues, there are still several issues missing among them: the absence of the accurate robust solution when the dynamics occur in the offloading environment; the absence of the recovery robust solution after the performance deviation brought by the dynamics.During the past ten to twenty years, cloud-edge continuum has been further investigated, many new topics attract attention.Among those topics offloading, as an essential part of cloud-edge continuum, has been studied [27].There has been many offloading solutions have been investigated and proposed from different perspectives: using hierarchical method [28], or collaborative optimization method [29], energy-efficient method [30].The optimization performance of the conventional approaches often come from explicit models based on different resources or workflows and corresponding offloading policies models sometimes even a very specific system.As with the increasing popularity, Machine Learning-based optimization solutions also have hence attracted certain research attention [4,5] in context of offloading.Among Machine Learning-based approaches, Reinforcement Learningbased approaches [5,6,17,31] optimize offloading interactively without asking for data labelling.However, the performance of the approaches aforementioned is depending on and easily influenced by the dynamics from each component of the MEC pipeline: the resource availability, the request pattern, the data transmission latency.Thus, any changes from those parts could lead to performance deviation for those approaches, which asks for repeating of the pruning process or training process when it comes to learning-based solutions.From the robustness perspective, the higher deviation means the lower robustness of the offloading performance.There are some work addressing this issue from robustness perspective: adaptive optimization approach [20], connection stability [32], robust network contention [33].However, compared with throughput or energy consumption, the offloading robustness among heterogeneous resources environments has not been well addressed lately.In the next section, we will formulate our approach step by step.
PAC-RL: Fiechter [34] first proposed the PAC RL framework, and algorithms with sample complexity O((SAH3/2) log(1/)) have been developed [35,36], which are minimax-optimal in time-inhomogeneous MDPs [37].These algorithms combine a well-chosen halting rule with an optimistic sampling rule.Most optimistic sampling strategies have been presented for regret minimization, where the policy t is the greedy policy with regard to an upper confidence constraint on the optimal Q function.In specifically, episodic MDPs are reached via the UCBVI method of Azar et al. [38] (with Bernstein bonuses).Instance-dependent upper limits on the regret for optimistic algorithms have been presented in recent publications [39][40][41].A complexity term that is dependent on the MDP instance is present in an instance-dependent bound, generally through the idea of a sub-optimality gap.In particular, Wagenmaker et al. [42] shown that optimistic no-regret sampling procedures cannot attain the instance-optimal rate for PAC identification.The basic idea is that an ideal PAC RL algorithm must visit each state-action combination at least a specific number of times, necessitating the use of playing strategies that cover the whole MDP in the fewest possible episodes.A regret-minimizer, on the other hand, concentrates on using high-reward strategies that, depending on the MDP instance, may be arbitrarily ineffective in traveling to remote states.

Methodology
In this section we elaborate the approach we propose: MLR-LC-DRLO in details.We firstly start with the formulation of latency-critical PAC-RL:

Latency-critical Probably Approximate Correct (PAC) reinforcement learning
With the conventional Reinforcement Learning set up, there is rare upper bound of offloading accuracy during the exploration process, which leads the optimization to undesired directions, wasting training time.So here we firstly formulate this upper bound of offloading to limit the training time and accuracy more preciscely.When there exit dynamics in the training environment, every time during the transition after the dynamic disturbances, the learning process needs to optimize the offloading policies from the scratch again.During this process, specific upper bound on the exploration of training will save training time and offer better accuracy.And also guidance during the transition process could also save retraining time.To this end, we propose a probably approximate correct RL-based offloading algorithm, which offers upper bound on the exploration process: where, n S denotes number of states, n S represents the number of actions, ε is the accuracy parameter, and γ is the discount factor.The proof follows.In our latencycritical PAC reinforcement-learning formulation, we take M as a finite Markov Decision Process(MDP), denoted as a tuple (S, A, T , R, Ŵ) .Within M , we take: S as the states set, A as the actions sets corresponding to each state, T as the transition distribution and represented as: S × A − → S , R is the reward distribution, and r ∈ [0, 1) is a reward discount factor.T (s ′ |s, a) indicates the prob- ability of the transition from states s to state s ′ out of the distribution T (s, a) .Each time-step here is defined as a single time interaction between the learner and the environment.Each time interaction between learning agent and the environment is described as a state-action pair (s, a) including the information of that the learner takes the specific action a from the state s.We use R(s, a) to denote the expected reward out of reward distribution R(s, a) .During the Learning process, the learner accu- mulates the rewards r ∼ R(s, a) when takes each action a at state s then transits to next state s ′ with the possibility: s ′ ∼ T (s, a) .By repeating this process, the objec- tive of the learner tries to achieve the objective, which is accumulating possible most reward within possible least times of attempts.A policy set consists of any strategy followed by the learner choosing actions.A stationary policy refers to the policy that produces an action based on only the current state, without considering the previous interaction experiences.For policy π , the discounted, infinite-horizon value function from state s is formulated as follows: where, H represents the number of the steps, which is a positive integer, V π M (s, H) indicates the accumulated value out of H-step under policy π , starting from state s.Specifically, let s t and ∇ t be the t th encountered state (10) and received reward, respectively, resulting from execution of policy π in MDP M .Here we define policy model π as non-stationary considering the dependencies among tasks.Here we define c = (s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , ...) as a learning path of A .In this manner, at time t the state s t is described as a serial state-action experiences denoted as: c t = (s 1 , a 1 , r 1 , ..., s t ) .Then we derive the expected value functions as follows: where the expected values take all previous possible policy paths the learner follows.The optimal policy is denoted as π * and has value functions V * M (s) and Based on the primary definitions, we further define several properties used in PAC-MDP set up: Definition of Sample Complexity of Exploration(Kakade 2003) Given an MDP M , an learn- ing algorithm A within M , for any fixed ε > 0 , the sam- ple complexity of exploration of A is the number of timesteps t such that the policy at time t, A t , satisfies:

Definition of Efficient PAC-MDP Given an MDP
M (here we refer the MDP we formulate as aforemen- tioned), an learning algorithm A within M , A is an efficient PAC-MDP (Probably Approximately Correct in Markov Decision Processes) algorithm when, given ε > 0 and 0 < σ < 0 , A satisfies: the per-timestep com- putational complexity, space complexity, and the sample complexity of A are less than some polynomial of (S, A, 1/ε, 1/σ , 1/(1 − γ )) , with probability greater than 1 − σ .A is PAC-MDP when the definition is relaxed to be without computational complexity requirement.
Definition of Admissible Heuristics Given an MDP M , an learning algorithm A within M , we define a function: it is admissible heuristic when it satisfies: for all s ∈ S and a ∈ A.
We also assume that U (s, a) ≤ V max for all (s, a) ∈ S × A and some quantity V max .We set: since we have: V * (s) = max a∈A Q * (s, a) , which is at most 1/(1 − Ŵ) .Therefore, without loss of generality, we assume for all (s, a) ∈ S × A.
We assume that after each time disturbance of the dynamics, before the new convergence of the training, the offloading policy is an admissible heuristic.Considering that the learner has acted with respect to some experienced state-action pair (s, a).We define n(s, a) as the n-step experiences, where the learner takes action a from state s.Throughout the experiences, the received rewards at state s by taking action a: r[1], r [2], ..., r[n(s, a)] .Then, the empirical mean reward is: After taking an action, the learner changes the environment accordingly through this interaction.We describe this process as: the learner has taken action a from state s and immediately transitioned to the state s ′ through n(s, a, s ′ ) times action-taking.Throughout this process, the empirical transition distribution T (s, a) satisfies: The objective of the learner through the learning process is to maximize the current action value, Q(s, •) by choosing the specific actions, offloading strategies here, and applying them to the environment.The update step is to solve the following set of Bellman equations: redefine n(s, a) to be minimum of m and number of times state-action pair (s, a) has been experienced.

Proof
Let Q i (s, a) denote the action-value estimates after the i th iteration of value iteration.We also have: Then we have: By deriving from the fact: ξ 0 ≤ 1/(1 − γ ) we get that: Setting this value to be at most β and solving for i yields i ≥ ln(β(1−γ )) lnγ .We claim that: Note that ( 25) is equivalent to the statement 1 − γ ≤ −lnγ , which follows from the identity e x ≥ 1 + x .Given the previous setup and assumption, as efficient PAC-RL, to achieve an α−optional policy it is sufficient to run it for iterations number: The real-valued parameter, ε 1 , that specifies the desired closeness to optimality of the policies produced by value iteration.Based on this, we drive m and ε 1 with the characterization of other parameters including: ε, σ , S, A, γ in context of the theoretical guarantees about the learning efficiency.
Firstly we give explicitly definition of m and ε 1 during the learning process and some internal parameters: (  the condition of a an update is performed is the following equation holds:

Update Rules Formulation
Then to simplify the calculation, the learning agent only calculates the updates when the FLG(s, a) is 1 (true), decreasing the update attempts to finite times.The conditions of turning FLG(s, a) to be true are: firstly, initialization set up.Secondly, when any stateaction pair is updated.Conditions of turning FLG(s, a) from true to false is when no updates are made during a length of time for which (s, a) is experienced m times and the next attempted update of (s, a) fails.In this way (27 no more attempted updates of (s, a) are allowed until another action-value estimate is updated.
As shown in Fig. 3, we describe the overall learning process step by step.In general, the learning agent samples m steps in different environments for exploration then turn to the exploitation process.After finishing the learning process within each environment, the learning agent turns to another environment, repeating the same learning period.Once the dynamics appear, the learning agent also sample just first m samples in the new environment, doing the exploration and exploitation with the upper bound Õ(n 2 S × n A /(ε 3 (1 − γ ) 6 )) .In this way, the learning agent is able to keep the learning process always with the upper bound.Especially during the process right after the dynamics, the fixed sampling complexity and exploration upper bound helps against the influences from the newly changed environment.

Formulation of latency-critical PAC-RL
In this section, we continue the formulation of the learning process one step further to the formulation of the Reinforcement Learning and the Meta Learning.Based on the MDP M aforementioned, we formulate the RL part as follows: 1 State: The needed state information of a task, ta i , during the offloading process includes the encoded DAG dependencies and the corresponding offloading plans.The detailed state definition is as follows: (29) 2 Action: The offloading choice of each task is a constant value, which indicates: execution locally, execution on different VMs with different resources.By adding up the actions of all the tasks we get the action space A. 3 Reward: Throughout the learning process of offloading, minimizing latency Lat A c 1:n , defined in Eq. 9 is the primary objective.To achieve this, we formulate the reward function into an estimated negative increment of the latency calculated every execution of an offloading decision taken for a task.The detailed definition is as follows: (30) More detailed offloading policy model learning paradigm with aforementioned three parts is shown in Fig. 4. In our proposed training paradigm, we build up both encoder and decoder based on recurrent neural networks(RNN) [43] to learn the dependencies among tasks.First we apply the tasks embedding, which is the input of the encoder.We define F en as the encoding func- tion, the each step output of the encoder, e i , is corre- spondingly formulated as: To make sure decoder learn from different part of the source sequence without information loss, we apply the attention mechanism [44].The output of the encoder is the input of the decoder, where we define the decoding function as F de .After decoder we get the offloading policies for the workflows,d j .The decoding process is as follows: where c j is the context vector at decoding step j and is computed as a weighted sum of the encoder as follows:.( 32) The weight α ji of each output of encoder, e i is computed by where f (d j−1 , e i ) , is used to calculate the percent- age that how much possibility the input at position i matches the output at position j.Regrading the structure of NN(Neural Network), we adopt the sequence-tosequence neural network [45], which is good at learning context information.The policy learned by NN is formulated as Pol(a j |s j ) .The value function is formulated as v Pol (s j ) .The action a j is determined based on the follow- ing calculation:

Formulation of MLR-LC-DRLO
Based on the aforementioned PAC-RL formulation, we then optimize robustness concern by integrating Meta-Learning optimization part [46].As to Meta-Learning optimization part, we have two loops of training: inner loop and outer loop, which we will elaborate in the following part.Overall we define the objective function based on Proximal Policy Optimization (PPO) [47]: where, π θ o i is the sample policy, θ o i is the vector of parameters of the sample policy network, π θ i is the target policy, where θ i equals to θ o i at the initial epoch.Pr t is the prob- ability ratio between the sample policy and target policy, which is defined as We also define a function slice 1+ǫ 1−ǫ (Pr t ) to remove the incentive for moving Pr t outside the interval [1 − ǫ, 1 + ǫ] giving specific limit to the value of Pr t .
We formulate our advantage fucntino based on general advantage estimator (GAE) [48].The detailed formulation which is as follows: (35) where Ât denotes the advantage function value at time step t, ∈ [0, 1] is used to control the trade-off between bias and variance.Overall, we define the objective function for each inner layer task learning as: where c 1 is the coefficient of value function loss.The outer layer objective is expressed as: where We adopt the fist- order to approximate the second-order derivatives to save some calculation, which is defined as follows: where we get n samples learning tasks in the outer loop, α is the learning rate of inner loop training, and m is the conducted gradient steps for the inner loop training.

Algorithm 2 PAC-RL AlgorithmAlgorithm
This section describes the detailed process of the MLR-LC-DRLO algorithm, integrating and going through each part of the methodology formulated previously.As is shown in Algorithm 1, the input includes distribution over tasks, learning rates of the outer and inner loop.The meta policy neural network parameters are denoted as θ .
We firstly sample a batch of learning tasks T with batch size n and conduct inner loop training for each sampled learning task.The inner loop training is conducted based on the PAC-Reinforcement-Learning we formulated aforementioned.The first step is the initialization of the algorithm: setting the initial parameters of the policy model and resetting the data set D .Then is the sampling step: based on the number of environments, N data trajectories are sampled from the distribution according to the current policy model and added to the data set D .
The following inner layer learning loop from are PAC-RL-based learning processes; sampling data sets τ H inside D to calculate updated θ ′ H based on each loss function with PAC-RL.When the PAC-RL converges or reaches the upper bound of the exploration, unlike conventional RL or other learning methods, the overall policy model is not updated by inner layer learning agent.After achieving updated θ ′ H , RL agent uses θ ′ H model to sample new data samples τ ′ H from D .After this, the algorithm turns to the outer learning layer, and the meta learner uses θ ′ H to calculate loss function based on τ ′ H to achieve an update of the overall policy model.In the next section evaluate MLR-LC-DRLO's performance.

Evaluation Measurements
We define the measurements as Offloading Latency-Critical Measurements We define several measurements to indicate and compare different experimental results and investigate different metrics specifically.One group is related to latency missing rate and offloading performance: QoS-Latency-Critical Rate (QLCR) [6]: total percentage of executed tasks that meet latency required by QoS.
Expected-Latency-Critical Rate (ELCR) [6]: total percentage of executed tasks that meet expected latency.ELCR indicates the level of latency-critical for each method.
Necessary Training Iterations(NTI): the training iterations needed for convergence of policy model in an environment.
Offloading Robustness Measurements Robustness measurements includes: Dynamic Pressure Index(DPI), Offloading Performance Deviation(OPD) and Adaptation Steps and Data Usage for Performance Recovery(ASDUPR).They are formulated as follows: Dynamic Pressure Index(DPI): the indicator of the dynamic level of the current environment, including portion of workload change, latency change.It is defined as follows: where WOR before , WOR after denotes the instant workload before and after the dynamics respectively.DPI shows the pressure level the system currently is having brougt by the dynamics.For OPD: where, PER after denotes the instant average offloading latency after the influence of dynamic, PER before indicates the previous converged average offloading latency value.Besides the instant performance deviation, ASDUPR is proposed to describe adaptation, includes time and data iteration needed for adaptation after performance deviation incurred by dynamics: where ITER demonstrates the iteration time, t o describes time spent for each iteration.
Based on the metrics defined previously, we implement comprehensive evaluation to validate robustness of MLR-LC-DRLO.Throughout the implementations do we aim to evaluate our proposed MLR-LC-DRLO in next section.

Set up
The configuration of the implementation consists of two parts: the configuration of the platform, shown in Table 2, and the configuration of simulation model, shown in Table 3. Simulation Environment: We consider a cellular network, where the data transmission rate varies with the UE position.The CPU clock speed of UE, f UE is set to 1GHz.There are four cores in each VM of the MEC host with a CPU clock speed of 2.5 GHz per core.The CPU clock speed of a VM, f VM is 4 × 2.5 = 10 GHz.We implement a synthetic DAG generator according to [20] based on four parameters: n, fat, density, and ccr, where n represents the task number, fat controls the width and height of the DAG, density decides the number of edges between two levels of the DAG, and ccr denotes the ratio between the communication and computation cost of tasks (Table 4 and 5).

Results
As is shown in Table 6, we change same share of workload to show and compare the latency-critical offloading performance of our MLR-LC-DRLO algorithm against fine-tuned DQN, Double-DQN and CEM approaches on the same DAG data.More specifically, we add dynamic to scheduling by increasing workload for each method while keeping the the same resource availability setup.Then we assess the average latency rates of scheduled tasks to compare the performance robustness of the proposed MLR-LC-DRLO offloading against other methods.As is shown, we put the items in bold, which perform each row.compared with the fine-tuned DQN, Double-DQN and methods, our algorithm MLR-LC-DRLO offers more stable latency-critical offloading performance every time after dynamic influence in the environment.More specifically, MLR-LC-DRLO outperforms the fine-tuned DQN, Double-DQN and CEM approaches in the latency rates and necessary training iteration.From the perspective of latency rate, averagely more than 95.33% ± 0.34% tasks offloaded by MLR-LC-DRLO finish their execution with the shorter latency than the expected latency 720ms [20].While tasks offloaded by other RL-based methods finish their execution averagely with a range of 8.5 − 20% viola- tion rate of latency requirement.Moreover, when given heavier workflows (topology 2, n=30, UT=DT=5.5Mbps),shown in Table 6, our method MLR-LC-DRLO still offers workflows more stable latency-critical offloading performance, which is more percentage of tasks finish execution with the lower latency than the expected one under dynamics from the environments.After all MLR-LC-DRLO outperforms the fine-tuned DQN, Double-DQN and CEM in performance stability.
As shown in Figs. 5 and 6, we show the change of OPD and ASDUPR under different dynamics to show the robustness of MLR-LC-DRLO from the perspective of offloading performance stability and the expense taken for the recovery from the performance deviation.Firstly, from the perspective of offloading performance stability, as shown in Fig. 5: we increase the workload with same percentage for all the offloading approaches.The performance deviation of our proposed MLR-LC-DRLO always remains stable within 55% throughout different workload environments, in some environment the deviation is even under 25%.In contrast, fine-tuned DQN, Double-DQN and CEM approaches' performance deviation range is rather broader between 50% and even beyond 300% with the same portion of increased workload.Therefore, the offloading performance stability of our MLR-LC-DRLO outperforms the conventional RL-based offloading approaches.Then from the perspective of performance deviation recovery, as shown in Fig. 6 where we compare the adaptation speed  or performance recovery speed discounted by the performance deviation portion, which balances the adaptation speed and robustness performance.As shown, we cloud see that the adaptation speed of MLR-LC-DRLO is more than five times faster than fine-tuned DQN, Double-DQN and CEM averagely after every time increase of workload, at some point, even more, proving its robustness to dynamics of the environment.

Discussion
As shown in the result section, compared with conventional RL-based approaches, our proposed offloading approach MLR-LC-DRLO shows advantages in terms of the offloading performance robustness and recovery speed after influence from dynamics among heterogeneous environments.More specifically, the offloading performance deviation and adaptation speed of our proposed approach MLR-LC-DRLO show a stable pattern of change during increased workload.When the dynamic change of the DPI is within 30%-50%, both offloading performance deviation and adaptation speed increase with DPI; When the dynamic change of the DPI is in a range of 30%-50%, both offloading performance deviation and adaptation speed decrease with DPI; When the dynamic change of the DPI is beyond 50%, both offloading performance deviation and adaptation speed increase again.Overall, when the DPI is within 50%, MLR-LC-DRLO could stay robustness with lower than 30% performance deviation.When DPI goes beyond 50%, the performance deviation of MLR-LC-DRLO still stay within 50%.The robustness starts to decrease when DPI beyond 50% but still with lower than 50% performance deviation, much lower than fine-tuned RL methods (more than 300%).One of our future work directions is to expand the robustness range against the dynamics, that is keeping lower performance deviation against wider range of DPI change.Another direction of future work is to reduce the instant offloading performance deviation right after the DPI changes.Furthermore, by investigating exploration strategies of the RL framework, we could better control the training time and accuracy.We are currently investigating the exploration and exploitation accuracy of RL-based approaches.
Regarding the superiority achieved by our proposed methods, there are two main aspects of the insight, the first one is the merit that Meta-learning can leverage prior knowledge from previous tasks to improve learning on new tasks.In this way, the prior offloading knowledge can be accumulated and transferred to the following phase.By analyzing patterns and relationships across multiple environments, a meta-learning model can identify commonalities and transfer knowledge from one environment to another.This transfer learning can help a model learn new offload patterns more efficiently and effectively.Also, meta-learning can help avoid overfitting to specific training data by learning a more generalizable learning strategy.By training on multiple tasks, the metalearning model can learn to generalize across tasks and avoid overfitting to specific examples.This can lead to a more adaptive model that can perform well on a wide range of tasks and data.
The other aspect is Probably Approximate Correct (PAC), which is a framework in machine learning that aims to balance the accuracy of a model with the amount of data needed to achieve that accuracy.The PAC framework provides a way to measure the sample complexity  One advantage of the PAC framework is that it can lead to faster convergence of learning algorithms.The PAC framework is designed to ensure that a learning algorithm will be able to generalize well from the training data to new, unseen data.To achieve this, the PAC framework requires that the algorithm be able to achieve a certain level of accuracy with high probability, meaning that the algorithm should be able to correctly classify most of the test examples with high confidence.This requirement ensures that the algorithm will perform well on new data, even if it has not seen those examples during training.
In addition, the PAC framework provides a way to measure the sample complexity of a learning algorithm.This measure is based on the required level of accuracy and the confidence level, and provides a way to estimate the number of training examples needed to achieve the desired level of accuracy.This allows researchers to compare different learning algorithms and choose the As of the limitation of our work, one is the real-time adaptation efficiency.Currently, the algorithm is trained offline then adapt to a new environment.We plan to integrate an online-offline switch scheme in the future to improve the real-time adaptation efficiency.Also, more implementations of real world data is also part of our future steps.

Conclusion
In this work, MLR-LC-DRLO, a robust task scheduling framework, is presented to offer latency-guaranteed scheduling for time-critical tasks and improve the schedule's robustness in the meantime.We propose a meta-gradient robust reinforcement learning framework to quickly adapt a scheduling policy model to a newly changed environment while using a PAC-based latency-critical RL scheme to maintain the latency guarantee.Experimental results show that our approach can provide the latency guarantee, outperforming fine-tuned RL methods.Furthermore, our MLR-LC-DRLO approach finishes adaptation in new environments using fewer training iterations, 2 × to 5 × faster than the fine-tuned RL approach, achieving better robustness while offering latency guarantees.

Fig. 1
Fig. 1 Workflow Model where R(s, a) denotes the maximum-likelihood esti- mates for the reward, T (•|s, a) indicates transition distribution of state-action pair (s,a).That is, the computation of R(s, a) and T (s ′ |s, a) in Eq. 22, uses only the first n(s, a) = m samples.R(s, a) and T (•|s, a) here are the first m times observations of (s, a).So during the transition process, instead of modeling each state-action pair, we assert their value to be U(s, a).U(s, a) here is guaranteed to be an upper bound on the true value function as we formulated aforementioned.To simplify the notation, we is a constant added to value estimate as a bonus value of exploration.2 m is the number of experiences of a state-action pair before performing an update.3 l(s, a) denotes the number of samples collected for (s, a).
AU(s, a) represents the the running sum of target values used to update Q(s, a) once the learning agent collects enough samples.5 b(s, a) denotes the first timestep for which the first experience of (s, a) gets collected for the latest ongoing update attempt.6 FLG(s, a) ∈ {0, 1} indicated the binary value of sam- pling action: 1, to collect sample for (s, a); 0, not to collect sample for (s, a).

Fig. 5 Fig. 6
Fig. 5 Performance deviation of MLR-LC-DRLO and other 3 fine-tuned other fine-tuned RL-based approaches: with the same amount of DPI, MLR-LC-DRLO experiences lower performance deviation

Table 1
[23]tion Summary multi-agent DRL based approach[21], which adopts actcritic neural networks to calculate Q-value based on corresponding reward function.Shan et al. integrated DRL and Federated Learning to optimize resource allocation problems, which offers acceleration of DRL agents training.Lolos et al. proposed a novel full-model based RL[23]for elastic resource management, employs adaptive state space partitioning.

Table 3
[49]lation set up: We generate synthetic DAG according to[49], whose model is characterised by: n, fat, density, and ccr, where n represents the task number, fat controls the width and height of the DAG, density decides the number of edges between two levels of the DAG, and ccr denotes the ratio between the communication and computation cost of tasks

Table 4
Fine-tuned baseline approaches: we train DQN, Double-DQN, CEM based approaches as baselines of our proposed MLR-LC-DRLO

Table 6
Offloading Performance Comparison: we compare MLR-LC-DRLO with fine-tuned DQN, Double-DQN and CEM to show that MLR-LC-DRLO achieves better latencycritical offloading performance