Dynamic deployment method based on double deep Q-network in UAV-assisted MEC systems

The unmanned aerial vehicle (UAV) assisted mobile edge computing (MEC) system leverages the high maneuverability of UAVs to provide efficient computing services to terminals. A dynamic deployment algorithm based on double deep Q-networks (DDQN) is suggested to address issues with energy limitation and obstacle avoidance when providing edge services to terminals by UAV. First, the energy consumption of the UAV and the fairness of the terminal’s geographic location are jointly optimized in the case of multiple obstacles and multiple terminals on the ground. And the UAV can avoid obstacles. Furthermore, a double deep Q-network was introduced to address the slow convergence and risk of falling into local optima during the optimization problem training process. Also included in the learning process was a pseudo count exploration strategy. Finally, the improved DDQN algorithm achieves faster convergence and a higher average system reward, according to experimental results. Regarding the fairness of geographic locations of terminals, the improved DDQN algorithm outperforms Q-learning, DQN, and DDQN algorithms by 50%, 20%, and 15.38%, respectively, and the stability of the improved algorithm is also validated.


Introduction
Edge computing offers computing, communication resources, network and storage at the edge of the network near the terminal by sinking computing resources to the edge end of the terminal.Terminals can reduce their own energy consumption and task processing delay by transferring their computing tasks to the edge [1].Mobile edge computing (MEC) is a significant 5G technology that has undergone extensive research.The theory and application of the related research have seen a rapid growth since 2015 [2].
Despite the many benefits of edge computing, traditional base stations are constrained by their fixed locations and high implementation costs.In addition, infrastructure may occasionally be harmed by natural disasters.In the aforementioned scenario, the edge server is unable to completely serve the terminals.Due to its advantages in mobility, flexibility, and cost effectiveness, the unmanned aerial vehicle (UAV) has been widely used in both civil and military contexts for tasks like traffic management, disaster detection, emergency rescue, and target tracking [3].Some progress has been made in the application research of UAV-assisted MEC taking into account the flexible mobility of UAVs [4,5].One of the first to suggest UAV-assisted MEC was Motlagh et al. [6].By offloading to the edge, mobile terminals in MEC can significantly reduce energy consumption and latency.The fusion of an edge computing architecture and a UAV platform is referred to as UAV-assisted MEC.In addition to offloading to the edge computing server as a user node, the UAV can function as an air edge server for the terminal on the ground [7].The UAV's position can be adjusted to provide better service in accordance with the needs of the terminals.This architecture successfully addresses the drawbacks of fixed base stations.Offloading the workload to the UAV-carried edge server also helps to reduce communication congestion caused by frequent communications between multiple terminals and the cloud.
The limited endurance and storage capacity of the UAV have become the key issues in its application, which promotes the efficient and dynamic deployment of UAV-assisted MEC.In MEC, the energy requirements for flight and hovering propulsion, as well as computing offloading, are the two main determinants of the UAV's energy consumption [8].To fully utilize UAVs' potential in the MEC system, it is critical to conduct research on their trajectory design, hovering height, and dynamic deployment [9][10][11].Under constraints like energy consumption and UAV mobility, the dynamic deployment of UAVs for MEC involves planning the deployment trajectory of UAVs to satisfy the terminals' unique service requirements.
In UAV-assisted MEC, traditional optimization algorithms (such as heuristic algorithms [12], clustering algorithms [13], convex optimization algorithms [14], etc.) always have disadvantages, such as large amount, slow convergence speed, and poor processing effect in dynamic environments.Compared with traditional optimization algorithms, Deep Reinforcement Learning (DRL), which combines deep learning models with reinforcement learning algorithms, has powerful autonomous learning and decision-making capabilities, and can adapt to unstable environments.And through the adaptive adjustment strategy of technology, such as strategy gradient, so as to realize the optimal control in the dynamic environment.The environment is complex and changeable, with high real-time requirements that cannot be met by traditional optimization algorithms.In order to implement the dynamic deployment of the UAV in MEC scenarios, this paper will use the DRL method.
In order to meet the needs of terminals for computing offloading as well as to achieve obstacle avoidance, this paper will propose a DRL algorithm to dynamically deploy the position of the edge server of the UAV in the case where the terminal position is moving and there are obstacles.The following are this paper's main contributions: • First, it considers the time-varying channel parameters caused by the terminal's movement and the complex scene with obstacles on the ground, which is more realistic.We formulate the optimization problem and create a mathematical model for the optimization objective.In order to realize the training of DRL, the problem is transformed into a Markov decision model.• Second, on the basis of the double deep Q-network algorithm, a ε-pseudo count based exploration strat- egy is proposed, which is a hybrid of the ε greedy and pseudo count exploration strategies.The goal is to encourage the agent to investigate more states and actions, maximizing fairness.And the jain fairness factor f n (t) ∈ (0, 1) is used to measure the fairness of terminals being provided with computation offloading services by UAVs.When f n (t) is 1, the fairness is the greatest.• Third, the simulation experiment verifies the effectiveness of the proposed algorithm.The simulation results demonstrate that the improved algorithm is not only superior to the traditional DDQN, DQN algorithm and Q-learning algorithm in terms of convergence speed and average reward value; under the fairness factor, the improved DDQN algorithm also performs better than the traditional DDQN, DQN algorithm and Q-learning algorithm.In addition, its stability and universality are verified by changing the position, number and size of obstacles.

Related work
The UAV's energy use during flight is a key factor in determining the UAV's flight time, so the flight path of the UAV must be planned.The algorithms for managing the deployment of UAVs dynamically are primarily classified into two types, one is the traditional optimization algorithm including clustering algorithms, successive convex approximation algorithms, greedy algorithms, etc., and the other is the DRL algorithm.
In order to reach the purpose of minimizing the system's energy consumption, Huang et al. planned the trajectory of the UAV through three stages [12].The three stages used the differential evolution algorithm with variable population size, the mean clustering algorithm and greedy algorithm respectively.The authors implemented 3D UAV localization using the K-means clustering algorithm and grouped terminals into adjacent cluster heads [13].In order to extend the UAV flight time with charging station, Muhammad et al. proposed a three-stage joint routing and charging strategy, which uses optimization methods to design customer distribution area networks, charging station distribution area networks and distribution routes [15].In addition, the deployment of the UAV is generally optimized jointly with other indicators in the case of MEC with UAV assistance.Dai et al. proposed a generalized propulsion energy consumption model for rotary-wing UAVs, and jointly optimize user scheduling and UAVs trajectory to maximize UAVs energy efficiency [16].An effective approach based on successive convex approximation for concurrently optimizing the UAV's trajectory and the bit allocation was suggested by Jeong et al. so as to reduce the overall energy consumption of the terminals while meeting the QoS criteria [14].In order to break down the problem with joint optimization issue of terminal scheduling strategy, UAV's trajectory and transmit power into a series of feasible sub-problems to tackle, Qi et al. used successive convex approximation, penalty function, and Dinkelbach approach [17].On the problem with joint optimization of UAV's trajectory and computing offloading, Hu et al. suggested a low-complexity offloading and trajectory scheduling method relying on Lyapunov optimization theory to minimize long-term energy efficiency [18].Aiming at the communication security problem of UAV, Xu et al. suggested an approach based on penalized block coordinate descent by concurrently optimizing communication resources, UAV's trajectory, and computing resources to optimize the minimal safe computing capacity [19].To increase the operating period of the UAV and the life of related network, Wang et al. decomposed region division and trajectory planning of UAV into two independent sub-problems, which were respectively modeled as semi-discrete optimal transportation problem and traveling salesman problem for solving [20].
In dynamic wireless environments, it might not be possible to make quick decisions using traditional optimization algorithms, like [12][13][14][15][16][17][18][19][20].The DRL which combines deep neural networks and reinforcement learning, has become a popular research topic in light of the quick development of artificial intelligence.Additionally, articles demonstrate how powerful DRL can be for solving complex control problems [21,22].Therefore, DRL method is widely employed to settle the trajectory planning problem of UAV-assisted MEC.In the case of limited UAV's energy and QoS constraints of each terminal, the DRL method is utilized to improve the UAV's trajectory in order to maximize the system's long-term return.In order to maximize the system return and meet the constraint of QoS, Liu Qian et al. suggested QoS behavior selection strategy based on a double deep Q network algorithm to plan the UAV's flight path with limited energy [23].Wang Liang et al. adopted multi-agent deep reinforcement learning algorithm to realize the dynamic deployment of multi-UAV assisted MEC in the scenario of multi-UAV, so as to meet the load balancing among UAV clusters [24].Yin et al. used the multi-agent reinforcement learning approach to represent trajectory planning and resource allocation as a decentralized partially observable Markov decision process, with the goal of optimizing overall throughput and fair throughput [25].UAVs can be utilized as temporary base stations to offer edge services to road vehicles which have heavy traffic.The reason is that the general mobile edge computing scheme with fixed base state cannot sufficiently manage the urgent communication needs in vehicle networks.A UAV-assisted vehicle communication network system was designed, and a traffic situational awareness-based algorithm for the best UAV flight trajectory was put forth to reduce the cost of UAVs [26].Bor Yalinzi et al. considered both energy efficiency and coverage rate of terminals to optimize the UAV base station layout [27].Hu et al. investigated task offloading and trajectory design in tandem to minimize the weighted sum of system energy consumption [28].
Furthermore, previous papers on UAV-assisted MEC rarely take collisions into account, which obviously does not conform to the display of real life.Therefore, Chang Huan et al. proposed to adopt a DRL method to realize the dynamic deployment of UAV edge computing platform in a complex environment with obstacles on the ground [29].However, the position of the terminal will change with time in reality.In order to ensure that the UAV can offer computing services to the mobile terminals while avoiding obstacles, the UAV needs to adjust its trajectory in time.Moreover, in most existing studies, the deployment of UAVs is a one-time deployment.When the location of the terminal on the ground changes, the original deployment location may not provide the terminal with the optimal edge computing service.To sum up, in the case of obstacles on the ground, movement of terminals, and continuous deployment of UAV positions according to the needs of the terminals, it is still challenging to design the deployment trajectory of UAV, which is also the main motivation for the research work in this paper.

System model
For mobile edge computing scenarios, when a single UAV carrying an edge server provides edge computing services for multiple mobile terminals on the ground, it needs to meet obstacle avoidance, UAV's own constraints, and fairness of geographic location of mobile terminals.System model is described in this section.
As shown in Fig. 1, there are a single UAV, N mobile terminals and K obstacles in a rectangular map with a length of ℓ and a width of w.The sets of terminals and obstacles are denoted as The geometric center of the obstacle is used to represent the location information, and it is denoted as b k = (x k , y k , h), ∀k ∈ K , where h is the height of the center point of obstacle.This paper considers a discrete-time system and divides time into T slots, denoted as t ∈ T = {0, 1, 2, ..., T } .The UAV flies over the target area to offer edge computing services for terminals at a fixed height H .The UAV's location is indicated as u uav,t = (x t , y t , H), ∀t ∈ T at the tth time slot.The initial positions of N terminals are randomly distributed on the ground.Due to the mobility of terminals, the location coordinates of terminal n are u n,t = (x n,t , y n,t , 0), ∀n ∈ N, t ∈ T at the tth time slot.Assuming that terminals have a task of random size to offload in each time slot, we can derive the following constraints.
The main parameters and parameter meanings of this paper are displayed in the Table 1 below.

Movement model of terminals
Considering the mobility of terminals in the scenario, assuming that the position of the terminals does not change during the duration t,t−1 between the tth and t-1th time slots.The Random Gauss-Markov Mobility (RGMM) model [30] is used to represent the mobility of the terminals.The RGMM is a model based on Gauss-Markov process, which is widely utilized in signal estimation and other fields.In a fixed interval, by changing its speed and direction to establish the correlation of speed and time of a moving terminal.The speed v t and direction angle θ t of the moving terminal on the ground at the tth time slot are defined as (1)  where 0 < ϕ < 1 is the memory level.µ v and µ θ are the mean values of speed and angle, respectively.σ v and σ θ are the standard deviation of speed and angle, respectively.ω v t−1 is an unrelated Gaussian process that is unre- lated to v t−1 and has zero mean and unit variance.ω θ t−1 is an unrelated Gaussian process that is unrelated to θ t−1 and has zero mean and unit variance.Therefore, the position coordinate of moving terminal n is defined as The location information of terminal n is represented by formulas (3), ( 4).Since the constant movement of terminals will bring about the disaster of action space dimension, the action space needs to be preprocessed.The map is divided into multiple subdomains, and the location coordinates will be updated only when the terminals move outside the subdomain.Otherwise, the location coordinates will not be updated.But the channel transmission parameters are still time-varying.

Obstacle avoidance model
As shown in Fig. 2, it shows the top view of the relationship between the flight path of the UAV and the position of the obstacle k.The outline of obstacles on the ground is simplified into a cylinder with radius R, (2)  The distance between the straight line of the flight path of the UAV and the center of obstacle k is If the distance between the straight line where the flight path of the UAV is located and the center of the obstacle k is greater than the radius R k of the obsta- cle, the UAV will not hit the obstacle; If the radius R k is smaller than the distance between the UAV and the obstacle k, it is necessary to judge cos A and cos B. if cosA > 0 and cosB > 0 then collide.cos A and cos B are expressed as (5) Define the obstacle avoidance variable as O k,t = {0, 1} .O k,t = 0, ∀k ∈ K, t ∈ T indicates that the UAV col- lides with obstacle k at the tth time slot, and O k,t = 1, ∀k ∈ K, t ∈ T indicates that UAV successfully avoids obstacle k, specifically it is expressed as

Energy consumption model
This paper assumes that the entire deployment process is divided into T time slots of unequal length, and each time slot t is divided into three parts.The first part is the scheduling decision.The agent obtains the next deployment position by analyzing the current position of terminals and obstacles and other information; The second part is the flight time, that is, the time required for the UAV to fly to the new deployment location; The third part is the time of computation offloading.After the UAV arrives at the new deployment location, it provides edge services for terminals.The agent interacts with the environment and makes decisions quickly, so the decision time is negligible compared to the flight time and computation offload time, the tth time slot division is shown in Fig. 3 below.
The flying distance of the UAV is d uav,t at the tth time slot, assume that the UAV is flying at a fixed speed v uav (7 In the MEC scenario, it is assumed that the flying height of the UAV is always lower than the height of the obstacle, and the UAV needs to avoid the obstacle before it can provide services for the terminals on the ground.Therefore, the wireless communication link between the UAV and the terminals adopts the line-of-sight link communication.At the tth time slot, the uplink data transmission rate is expressed as where d uav,n,t is the horizontal distance between the UAV and the terminal n at the tth time slot, and it is expressed as B is the bandwidth of the channel and P tr is the trans- mission power, ρ = g 0 G 0 /σ 2 , G 0 ≈ 2.2846 , where g 0 rep- resents the channel power gain at the reference distance of 1m, and σ 2 represents the noise power.At the tth time slot, the time for the UAV to compute unloading for terminal n is defined as The time of computation offloading is divided into two parts, one is the task transmission, and the other is the task calculation.Where D n,t /r uav,n,t is the time required for terminal n to transmit mission data to the UAV at the tth time slot, indicates the amount of tasks that terminal n needs to offload at the tth time slot; D n,t indicates the time required for the UAV to calculate tasks offloaded from terminal n at the tth time slot, among them, F n,t = D n,t C n,t is defined as the CPU cycle required to calculate all tasks, C n,t is the CPU cycle required to calculate each bit of data for the UAV, and f uav is the CPU frequency assigned to tasks by the edge server on the UAV. Then the flight energy consumption of UAV at the tth time slot is defined as (10) where P f represents the flight power of the UAV.At the tth time slot, the energy consumption required by the UAV to compute offloading for terminal n is defined as The energy consumption of computation offloading is divided into two parts, one is transmission energy consumption, and the other is computing energy consumption.P tr D n,t /r uav,n,t indicates the transmission energy consumption of terminal n offloading the task to the UAV at the tth time slot, where P tr represents the transmission power; κD n,t C n,t f 2 uav indicates the computational energy consumption of the UAV at the tth time slot.κ= 10 −26 is a hardware-related constant.
In order to ensure the stability and reliability of the communication link, the UAV needs to hover while providing computation offloading services for terminals.Therefore, the hovering energy consumption of the UAV serving terminal n at the tth time slot is defined as where P h is the hovering power of the UAV.Then the total energy consumption of UAV at the tth time slot is The UAV needs to serve terminals within the energy consumption range, T t=0 E total uav,n,t ≤ E max uav .The total power of the UAV is W. When the energy of the UAV is exhausted, it will stop serving or return to charge.

Problem formulation
In the case of obstacles on the ground, by dynamically deploying the position of the UAV to meet the computation offloading requirements of the mobile terminals and realize the obstacle avoidance of the UAV, it will lead to some unfair problems.For example, the UAV may only serve nearby terminals in order to save energy and avoid obstacles, and do not consider providing edge computing services for distant terminals.As a result, some terminals may be ignored and have not been served by the UAV.We use jain fairness factor to solve this problem.First, the offloading variable is defined as (  If all terminals are served by UAV for similar times from the initial time slot t=0 to T, then the value of f n (t) is closer to 1, otherwise it is closer to 0. The optimization problem is formulated as follows: The optimization objective is to minimize the UAV's energy consumption while maximizing the obstacle avoidance variable between the UAV and obstacles, and to realize the fairness of the terminals being served by the UAV.(21b) means that from t=1 to T, the UAV's total energy consumption cannot exceed the maximum energy consumption; (21c) means the UAV cannot fly faster than its maximum speed; (21d) means that the UAV's flight range is not allowed to go beyond the fixed area; (21e) means that the CPU frequency that the MEC server installed on the UAV assigns to the task cannot exceed the maximum frequency; (21f ) means that the UAV can only serve one terminal during time slot t. (19) Z n,t = {0, 1}, ∀n ∈ N, t ∈ T (20)

MDP
The DRL takes advantage of the agent's interaction with the environment and allows the agent to learn the optimal action through rewards.In accordance with the system model in the above section, the environment and action of the system are judged to have Markov properties, so the paper utilizes Markov decision process (MDP) to describe the system model.This section introduces the MDP and the DRL model for the above optimization problem.And a DRL algorithm based on an action selection strategy of ε -pseudo count is used to realize the dynamic deployment of UAV edge server.
The MDP mainly consists of state, action, reward and discount factor γ , which is used to describe the interac- tion process between agent and environment, and the learning process of agent in DRL.Considering the system model in the aforementioned section, the problem we need to solve conforms to the MDP, and the following is to establish the Markov decision model.
1) State Space: first, we describe the state space for each episode as , where s i is the state at the ith time slot where u n,i is the terminals' location information at the ith time slot; u uav,i is the position information of the UAV at the ith time slot; b k is the location infor- mation of obstacles on the ground; D n,i is the amount of tasks that the terminal n needs to offload at the ith time slot; O k,i is the obstacle avoidance variable at the ith time slot; r uav,n,i is the channel transmission rate between the UAV and terminal n at the ith time slot; i = {u n,i Z n,i } represents the terminal n served by the UAV at the itextth time slot.2) Action Space: the agent selects the action with the greatest reward based on the current state, i.e. the next position of the UAV.Action space is defined as 3) Reward Function: when the position of the UAV is deployed once, the reward can be calculated, which is the feedback of the whole system model to the DRL model.Specifically, reward refers to the ratio of fairness factor of edge computing service to total energy consumption, obstacle avoidance variable and additional reward brought by pseudo count exploration for terminals after dynamic deployment of the UAV.Therefore, the reward obtained from the current state and action is defined as among them, ω 1 and ω 2 are the coefficients that adjust the reward ratio.Respectively, ψ(N (s)) rep- resents the additional rewards explored by pseudo count strategy.

Deployment algorithm based on DDQN with ε -pseudo count selection strategy of action
The DDQN algorithm in the DRL algorithm is used to eliminate the overestimation problem of the DQN by decoupling the selection of actions in the target Q value and the calculation of the target Q value.The algorithm structure of DDQN and DQN is the similar, but the way of updating in the Q network is different.The DDQN chooses the action with the highest Q value in accordance with the parameters of the main network, whereas the DQN chooses the action in accordance with the parameters of the target network.This solves the problem of overestimation to a certain extent, making the Q value closer to the true value.
The DDQN consists of two neural networks: the target network and the main network.Specifically, Q main is the output of the main network, which is a value function used to evaluate the current state-action.Q target is the output of the target network. (24) The special is that it is updated according to the action which maximizes the Q value in the main network.The target Q value Y DDQN is defined as In order to avoid suboptimal results in DRL algorithms, common DDQN algorithms generally use ε -greedy strat- egy to explore and utilize.However, ε exploration may select the previously experienced states and actions, which cannot avoid the disadvantages of local optimality.
The core idea of pseudo count exploration is to calculate or estimate the frequency of each state-action by designing density model.And new state-actions are rewarded with higher bonuses, encouraging the agent to try more stateactions.The DRL algorithm is utilized to train the UAV dynamic deployment to meet the fairness of terminals being served by the UAV in this paper.Considering the principle of pseudo count exploration mode, it can be found that the DRL algorithm based on ε-pseudo count exploration mode has better performance to solve this problem.
The probability density of occurrence of s i is defined as ρ(s i ) = ρ(s i |s 1:t ) , and the probability density is ρ ′ (s i ) = ρ(s i |s 1:t s i ) when s i is observed in the next time.In order to better understand the density model, two concepts are introduced, namely the pseudo count function N (s, a) and the pseudo count total t .Then the probability density of the state s i appearing and the probability den- sity of s i observed in the next time are ρ(s i ) = N (s i )/ t and ρ ′ (s i ) = ( N (s i ) + 1)/( t + 1) , respectively.The relationship between ρ(s i ) and ρ ′ (s i ) is the learning-positive density model.Then we can get the expression of pseudo count (27) In addition, the reward defined by pseudo count exploration is ψ(N (s)) = N (s) −1/2 .The pseudocode of ε -pseudo count selection strategy of action is shown in the Algorithm 1.

Algorithm 1
The action selection strategy based on ε -pseudo countThe schematic diagram of the DDQN algorithm based on the ε-pseudo count exploration strategy is shown in Fig. 4.
It can be seen from Fig. 4 that ① Environment includes information such as the position of the UAV and the terminals, channel transmission rate, etc.;② Using ε -pseudo count exploration to explore the environment, and make the action a i according to the current state s i .Negative rewards will be given when the UAV col- lides with obstacles or consumes too much energy.The ε-pseudo count strategy based on the combination of the ε-greedy exploration strategy and the pseudo count exploration strategy can speed up the convergence rate and avoid local optimization, so as to achieve better exploration and utilization; Next, the reward R i+1 and the next state s i+1 are obtained, and then ③ the information obtained by the exploration is stored in the experience pool Memory M as a tuple (s i , a i , R i+1 , s i+1 ) .The experi- ence pool has a fixed size.When the experience pool is full, the previously stored tuples will overflow; ④ During neural network training, a fixed number of small batches of experience are extracted from Memory M for training, which is called Mini-Batch; ⑤ and ⑥ are two neural networks of the DDQN algorithm, which are the main network and the target network, respectively.The purpose of neural network is to obtain the maximum target Q value.In order to ensure the stability of the algorithm, the main network copies the parameter θ to the (28) target network after every fixed interval step C;⑦ and ⑧ are the Q values output by the two neural networks respectively, where ⑨ indicates that the Q value of ⑧ is updated according to the action of the maximum value of ⑦. ⑩Update network parameters by minimizing loss function.The pseudocode of deployment algorithm based on DDQN with ε-pseudo count selection strategy of action is shown in Algorithm 2.

Algorithm 2 Deployment algorithm based on DDQN with ε-pseudo count selection strategy of actionSimulation experiment
In this part, we simulate the performance of the DDQN algorithm with the proposed ε-pseudo count selec- tion strategy of action and analyze the results.In the PC environment of intel Core i7-1165G7, 2.8GHZ CPU, 16GB, we use python3.6 and tensorflow2.0 to simulate.In the algorithm, the main network and target network respectively adopt four fully connected hidden layers, and each layer has 50 neurons.

Create a simulation environment
On a map with a length of 1000m and a width of 600m, the UAV flies at a constant speed at a fixed height, and the obstacles in the training environment are simplified as cylinders, and the radius R of the cylinder, the height h and the coordinates of the obstacles are set; Secondly, set the starting positions of 30 terminals in the environment, and the starting positions of the terminals are randomly set; Finally, add the UAV to the environment, and set the relevant attribute parameters of the UAV.
The specific parameter settings are shown in Table 2 below.

Algorithm hyperparameter settings
The hyperparameter setting of DRL algorithm is very important.Only by setting appropriate hyperparameters can the neural network play a better performance and the algorithm converge.A key hyperparameter in the algorithm is the learning rate α .When α is too low, the con- vergence complexity of the network increases, and it is easy to fall into a local minimum.When α is too high, the gradient oscillates around the minimum.
As shown in Fig. 5, it is the convergence of the proposed algorithm under different learning rates α .It can be seen that the algorithm does not converge when α is 0.1 and 0.01; The algorithm converges when α is 0.001 and 0.0001, but when α=0.001, the algorithm converges in 5000 episodes, and the average reward value of convergence is higher than when α=0.0001.So set the learning rate α equal to 0.001.
Figure 6 shows the convergence of the proposed algorithm under different discount factors γ .It is used to control how future rewards affect the current reward value.A higher value of γ indicates that the agent has more steps to consider but is also more difficult to train, while a lower value of it indicates that the agent pays more attention to immediate interests.Therefore, an appropriate discount factor must be determined.It can be seen from the Fig. 6 that when γ=0.99, the agent is overly focused on potential future rewards, so the convergence speed is slower and more difficult.When γ is equal to 0.95 and 0.9, although the average reward value of the final convergence is similar, it converges at about 7500 episodes when γ=0.95, and it converges at about 5000 episodes when γ=0.9, which is obviously faster than when γ=0.95.It can be seen from Fig. 7 that when the Mini-Batch is 32 and 64, effective training samples cannot be extracted because the number of samples is too small, and the convergence trend is not obvious; When the Mini-Batch is 128, it converges at 5000 episodes and converges at about 1300.So set Mini-Batch equal to 128.Due to the mutual influence of the hyperparameters in DRL, after arranging and combining the parameters, a large number of comparative experiments are performed to determine the specific values.The remaining algorithm hyperparameter settings are shown in Table 3 below.

Result analysis
Figure 8 depicts the average rewards of the proposed algorithm, traditional DDQN, DQN algorithm and Q-learning algorithm.First, the algorithm proposed in this paper from about 6000 episodes and converges around 1250; The traditional DDQN algorithm converges after about 10,000 episodes, and the final Fig. 6 The convergence of the proposed algorithm under various discount factors Fig. 7 The convergence of the proposed algorithm under various Mini-Batch sizes reward value is 1200; The DQN algorithm converges to 850 around 8000 episodes.However, the Q-learning algorithm cannot converge because it cannot handle high-dimensional action spaces.It can be seen that the algorithm we proposed achieves the fastest convergence speed and the largest average reward value, indicating that the action selection strategy based on ε-pseudo count can more effectively realize the selection and utilization of actions by the agent.
Figure 9 depicts the fairness factor f n (t) of the proposed algorithm, traditional DDQN and DQN algorithms.It can be seen from the Fig. 9 that under the algorithm we proposed, it is around 0.9, which means that the UAV can satisfy the fairness of the geographical location of most terminals; Under the traditional DDQN algorithm, it is about 0.78, and under the DQN algorithm, it is about 0.75, which means that the UAV can satisfy the fairness of the geographic location of some terminals.The fairness factor f n (t) under the Q-learning algorithm is always the lowest and decreases with the increase of training times.
It can be seen that our proposed algorithm achieves the greatest fairness of geographical location among all compared algorithms.
In order to verify the stability and effectiveness of the proposed algorithm, the fairness factors of the improved DDQN algorithm, DDQN algorithm, DQN algorithm and Q-learning algorithm are compared under different obstacle numbers to verify whether the algorithm can still maintain good effectiveness in different scenarios.
As shown in Fig. 10, when the number of obstacles is 2,4,6,8 or 10, the fairness factor under the improved DDQN algorithm can always be maintained at about 0.9.The fairness factors of DDQN, DQN and Q-learning algorithms are maintained at about 0.78,0.77and 0.67, respectively.The fairness factor under the improved DDQN algorithm is always stable and superior to other algorithms, indicating that the proposed algorithm has good stability and effectiveness.
Under the above parameter settings, after the agent passes training and learning, the dynamic deployment diagram of the UAV is shown in Fig. 11

Conclusion
The dynamic deployment of a single UAV is planned track in the scenario where there are multiple mobile terminals and multiple obstacles on the ground in order to achieve the fairness of the geographical location of Finally, the environment for the simulation experiment is established.In order to achieve the optimal training effect, when using the improved DDQN algorithm to train the UAV, the hyperparameters are arranged and combined, and the hyperparameter values are adjusted until the optimal effect is achieved.The average reward value, convergence speed and the fairness of the geographical location of the ground terminals are used as evaluation indicators, and the improved algorithm in this paper is compared with the DDQN algorithm, DQN algorithm and Q-learning algorithm.The simulation results show that the improved DDQN algorithm in this paper has always maintained a good However, this paper only studies the single UAVassisted MEC.In the case of a large number of terminals and delay-sensitive task scenarios, a single UAV obviously cannot meet the needs.Therefore, we will conduct research on multi-UAV assisted edge computing networks in the future.

Fig. 1
Fig. 1 System model of UAV dynamic deployment

Fig. 3
Fig. 3 Diagram of the tth time slot division

N n=1 Z n,t = 1 ,
∀n ∈ N, t ∈ T , it means that the UAV can only serve one terminal at each time slot.The fairness factor f n (t) is used to represent the fairness of the geographical location of the terminals (the fairness of the UAV's service between the terminals)

Fig. 4
Fig. 4 Schematic diagram of deployment algorithm based on DDQN with ε-pseudo count

Fig. 5
Fig.5 The convergence of the proposed algorithm under various learning rates

Figure 7
Figure 7 shows the convergence of the proposed algorithm under different Mini-Batch sizes.Mini-Batch refers to the sample size that is subsampled from the experience pool and used for gradient calculation.The size setting should take into account both training stability and training speed.It can be seen from Fig.7that when the Mini-Batch is 32 and 64, effective training samples cannot be extracted because the number of samples is too small, and the convergence trend is not obvious; When the Mini-Batch is 128, it converges at 5000 episodes and converges at about 1300.So set Mini-Batch equal to 128.Due to the mutual influence of the below.Blue circles represent obstacles, and black lines represent the dynamic deployment trajectory of the UAV.The gray circles indicate the initial positions of the ground terminals, and the red circles indicate the position of the ground terminals when the UAV is serving the terminals.And the two green circles represent the start and end respectively.It should be noted that the depletion of the UAV's battery indicates the end of the UAV deployment.The end point is the last location deployed before power is exhausted.Therefore, the end point of UAV deployment in this paper is not fixed.

Fig. 8
Fig. 8 Average reward under different algorithms

Fig. 9
Fig. 9 Fairness factor under different algorithms

Fig. 11
Fig. 11 Deployment trajectory diagram of the UAV

Table 1
Main parameters Diagram of obstacle avoidanceand obstacles are represented by blue circles.Assuming that the flying height H of the UAV is lower than the height h of obstacles, the UAV needs to avoid obstacles in flight to avoid collisions.At the tth time slot, the coordinate of the UAV is (x t , y t , H ) , and the coordinate is (x t−1 , y t−1 , H ) at the t-1th time slot.The black dots indicate where the UAV stays at a certain moment, and the straight line between the two points indicates the flight trajectory of the UAV.Only relying on the distance between the current position of UAV and the center of the obstacle cannot accurately determine whether the UAV has collided with the obstacle.Therefore, it is judged whether the UAV collides with the obstacle during the flight according to the distance from the straight line where the flight track of the UAV is located to the center of the obstacle.At the tth time slot, the equation of the horizontal flight path of the UAV is expressed as Wtotal power of the UAV Fig.2

Table 2
Parameters of simulation environment

Table 3
The hyperparameters of the algorithm