Hypergraph convolution mix DDPG for multi-aerial base station deployment

Aerial base stations (AeBS), as crucial components of air-ground integrated networks, can serve as the edge nodes to provide flexible services to ground users. Optimizing the deployment of multiple AeBSs to maximize system energy efficiency is currently a prominent and actively researched topic in the AeBS-assisted edge-cloud computing network. In this paper, we deploy AeBSs using multi-agent deep reinforcement learning (MADRL). We describe the multi-AeBS deployment challenge as a decentralized partially observable Markov decision process (Dec-POMDP), taking into consideration the constrained observation range of AeBSs. The hypergraph convolution mix deep deterministic policy gradient (HCMIX-DDPG) algorithm is designed to maximize the system energy efficiency. The proposed algorithm uses the value decomposition framework to solve the lazy agent problem, and hypergraph convolutional (HGCN) network is introduced to strengthen the cooperative relationship between agents. Simulation results show that the suggested HCMIX-DDPG algorithm outperforms alternative baseline algorithms in the multi-AeBS deployment scenario.


Introduction
With the exponential growth of data-driven applications and the increasing demand for high-speed and ubiquitous communication services, the integration of edge-cloud computing [1,2] and aerial wireless networks has become a cornerstone of modern communication systems [3,4].The aerial base station (AeBS) [5,6] is a mobile base station that installs communication equipment on the aerial platform, such as the unmanned aerial vehicle (UAV) [7].AeBSs can offload computationally intensive tasks to cloud servers, enabling efficient data processing, storage, and analysis.This capability enables AeBSs to provide real-time data services, such as industrial internet of things [8,9], mobile edge computing networks [10,11], internet of vehicles [12][13][14] and health detection [15,16].With the popularity of mobile AeBSs, AeBSs-assisted cloud computing is crucial for delivering diverse services in areas without available infrastructure [17].To meet the increasing demand for higher communication rates from data-intensive applications and services in future cloud computing, network capacity [18] and system energy consumption [19] have always been the key metrics for network optimization.How to efficiently deploy AeBSs to achieve optimal system energy efficiency is a current research hotspot [20].
In the deployment of AeBSs, traditional heuristic algorithms require repetitive calculations, while deep reinforcement learning (DRL) provides a novel paradigm for enabling the accumulation and utilization of experience in the environment [21,22].Recently, the development of computing systems has also led to the widespread application of DRL in wireless communication systems [23].As the number of AeBSs that need to be controlled grows, researchers are turning to multi-agent deep reinforcement learning (MADRL) for deploying multiple AeBSs.MADRL effectively addresses the challenge of dealing with a large action space in centralized singleagent deep reinforcement learning (DRL) when it comes to jointly deploying multiple AeBSs.Additionally, taking into account AeBSs' capability to observe the states of other AeBSs within a certain range, recent studies have employed the decentralized partially observable Markov decision process (Dec-POMDP) [22,24] to model the deployment problem of AeBSs.
In the cooperative task scenario of partial observable reinforcement learning, for a single agent under the MADRL framework, its reward is likely to be caused by the behavior of its teammates, leading to the training of lazy agents with poor performance [25].Some researchers proposed value decomposition-based multi-agent reinforcement learning to solve the above lazy agent problem [26].Through the value decomposition, the contribution of each agent action to the total value function is identified, effectively mitigating the challenges associated with lazy agents in multi-agent reinforcement learning.
For each agent in MADRL, gathering information from other agents is essential for enhancing multi-agent systems coordination.Graph Convolutional Network (GCN) can aggregate neighborhood information and handle the irregular or non-Euclidean nature of graph-structured data [27].In recent years, some work has introduced GCN into MADRL to enhance the agent cooperation ability and improve the algorithm performance [28,29].The simple graph can hardly represent the complex connections between a large number of agents, while hypergraph is a kind of high-dimensional graphic presentation of data, which makes up for the loss of information in the simple graph and is dedicated to describing the system with pairs of combinatorial relations.The properties of the hypergraph allow more information about the nodes contained in the hyperedge.In addition, by selecting a variety of hyperedges, the prior knowledge of the connections between multi-agents can be easily combined, thus enhancing the cooperation between agents [30].

Related work
Early works adopt heuristic algorithms to solve the deployment problem of AeBSs.Ref. [18] presents an iterative solution that jointly optimizes the AeBS locations and the partially overlapped channel assignment scheme to maximize the throughput of a multi-AeBS system.But the heuristic algorithm is to find the solution of the problem, when the deployment environment of the AeBSs changes slightly, it must be recalculated.The model trained by deep reinforcement learning can make action selections according to the current observation, which effectively solves the problem of repeated calculation of the heuristic algorithm.Ref. [31] proposes a DRL approach that relies on the AeBS networks flow-level models for learning the optimal traffic-aware AeBS trajectories.Ref. [32] proposes a deep Q-learning algorithm that allows AeBSs to learn the overall network state and account for the joint movement of all AeBSs to adapt their locations.
With the increase of the number of AeBSs to be deployed, the centralized DRL using a single agent will produce the problem of too large action space, and researchers begin to use MADRL.Ref. [33] proposes a MADRL-based approach to minimize the network computation cost while ensuring the quality of service requirements of IoT devices in the UAV-enabled IoT edge network.Ref. [34] proposes an improved clip and countbased proximal policy optimization (PPO) algorithm to solve the partially observable Markov decision process (POMDP) UAV development model.Ref. [35] uses multiagent deep deterministic policy gradient (MADDPG) to maximize the secure capacity by jointly optimizing the trajectory of UAVs and power control.In Refs.[36], MADDPG is adopted to decide the location planning of AeBSs, and results show that the MADDPG-based algorithm is more efficient than centralized DRL algorithms in obtaining the solution.
To solve the problem of cooperation between multiple AeBSs, some researchers have introduced GCN into the deployment of AeBS.Ref. [28] proposes a heterogeneous-graph-based formulation of relations between ground terminals and AeBSs.Ref. [29] proposes a DRLbased control solution to AeBS navigation which enables AeBSs to fly around an unexplored target area under partial observation to provide optimal communication coverage for the ground users.Ref. [37] proposes a GCNbased trajectory planning algorithm that can make AeBSs rebuild communication connectivity during the self-healing process.Ref. [38] proposes a GCN-based MADRL method for UAVs group control, which enables the utilization of mutual interactions among UAVs, resulting in improved signal coverage, fairness, and reduced overall energy consumption.
Motivated by the above literature, this paper introduces the value decomposition algorithm into deep deterministic policy gradient (DDPG) algorithm and adopts hypergraph convolution to learn the cooperative relationship between AeBSs.

Motivation
In the face of the high data rate requirements of future services, the energy efficiency enhancement of AeBSbased cloud computing networks is crucial.To achieve this goal, the locations of AeBSs can be adjusted, leverage the advantages of flexible deployment of AeBSs, and adapt to network dynamic requirements.MADRL is regarded as an effective technique for AeBS deployment due to its better performance compared to centralized DRL.However, MADRL has the following two problems in the deployment of multiple AeBSs: 1) in the MADRL framework, for a single agent, its reward is likely to be caused by the behavior of other agents, leading to the training of lazy agents; 2) it is difficult to learn the cooperative relationship between AeBSs.Aiming at the problem of lazy agents, value decomposition reinforcement learning is used to clarify the contribution of each agent and improve the performance of the multiple AeBSs deployment algorithm.To strengthen the cooperation of AeBSs, the hypergraph convolution is introduced into the deployment of AeBSs, and the cooperation of each AeBS is learned through the hypergraph convolution network.Therefore, in this work, we adopt hypergraph convolution mix DDPG to effectively address the deployment optimization problem of multi-AeBSs.

Our contributions
In this paper, we study the deployment problem of AeBSs, to maximize the energy efficiency of the multi-AeBS system.We propose the hypergraph convolution mix deep deterministic policy gradient (HCMIX-DDPG) algorithm to solve the optimization problem.The main contributions are listed as follows: 1 The energy efficiency maximization problem of AeBSs is modeled as Dec-POMDP in light of the constrained observation range of AeBSs.Furthermore, to combat the issue of lazy agents, we incorporate the concept of value decomposition.This approach helps elucidate the individual contributions of each agent's actions to the overall value function, ultimately enhancing the performance of our multi-AeBS deployment algorithm.2 To strengthen the cooperation among AeBSs, the hypergraph is used to represent the cooperation relationship of each AeBS.The output of the hypergraph convolutional (HGCN) network is the improved value considering the cooperation of each agent.HGCN enhances the cooperation among agents in the MADRL algorithm, and then improves the performance of the algorithm.3 The simulation results show that the performance of HCMIX-DDPG algorithm is better than other baseline DRL algorithms, and the value decomposition and HGCN can improve the deployment performance of multi-AeBS.

Organization
The remainder of this paper is structured as follows.In System model section, the system model and problem formulation are shown.Then, the HCMIX-DDPG algorithm for multi-AeBS deployment is described in Hypergraph convolution mix algorithm for AeBSs section.Discussions of the simulation results are included in Simulation results and discussions section and the paper is finally concluded in Conclusion section.

System architecture
We consider a multi-AeBS communication scenario where AeBSs serve ground UEs, as shown in Fig. 1.Each AeBS has an observation range, and it can observe the location information of other AeBs within this range.In the current scenario, each AeBs is an agent, which makes decisions based on its state and the state of other AeBss within its observation range.

Air-to-ground channel model
AeBSs and user equipments (UEs) communicate data via an air-to-ground (A2G) channel.The mean A2G path loss between UE u and AeBS m can be represented as: where L m,u FS represents the free space path loss, L m,u FS = 20log(4π f c d m,u /c) , where c is the light speed, f c is the carrier frequency, and d m,u is the distance between AeBS m and UE u. η LoS and η NLoS refer to the mean exces- sive path loss under the LoS and NLoS environment, respectively.
The probability of LoS is related to the communication environment constants α, β and elevation angle θ m,u , which can be expressed as [39]: And the probability of NLoS can be obtained as P m,u NLoS = 1 − P m,u LoS .The average path loss between UE u and AeBS m is: We consider AeBSs to serve ground users through millimeter wave beams.Therefore, the directional mmWave antenna gain also plays a significant role in the AeBS channel in addition to the A2G propagation path loss.In this work, we adopt the 3D mmWave beam scheduling model in Ref. [40]. (1)

Capacity model
Each UE is associated with the unique AeBS with the strongest received signal, and the signal interference of other AeBSs to this UE is considered.If AeBS m and UE u are associated, the SINR ξ u of the signal received at UE u can be expressed as: where P m is the transmit power of AeBS m, M is the total number of AeBSs in the system, σ 2 is the thermal noise power, G T and G R are the main lobe gain of the transmit- ter and receiver.In this paper, each UE selects the AeBS with the largest SINR for the association.
The capacity of UE u can be expressed as: where τ is the beam alignment time, B represents the channel bandwidth, and T is the time slot, η u is the aver- age ratio of time-frequency resources that can be calculated from Eq.( 7) in Ref. [40], η u occupied by UE u is given by: where N b is the number of mmWave beams of the AeBS and N u is the number of UEs served by this AeBS.
Then the system capacity is as follows: (5) where U m is users served by AeBS m.

Energy consumption model
The system's energy consumption is categorized into two components: the first component is the power consumption for propelling the AeBSs, and the second component is the energy consumed for communication by the AeBSs serving UEs.
The propulsion power consumption P w [41] of the AeBS can be modeled as: where P 0 and P 1 represent the blade profile power and induced power during AeBS hovering, v 0 is the average rotor-induced velocity at hover, U t is the rotor blade tip speed, d 0 is the fuselage drag ratio, s is the rotor solidity, A denotes the rotor disc area, and ρ stands for air density.
The communication power consumption P c of the AeBS can be modeled as: where P m,u is the transmitting power of AeBS m to UE u.The total energy consumption definition of the system is modeled as follows: where P w,i and P c,i the propulsion power consumption and communication power consumption of the i-th AeBS, n is the number of AeBS.

Problem formulation
Our goal in this work is to maximize the energy efficiency of the entire system.The energy efficiency of the system J tot is defined as follows: Consequently, the formulation of the optimization problem is as follows: where J tot energy efficiency in (12).C1-C3 constrain AeBSs from traveling out of the considered region, and C4 is the collision restriction.C5 ensures that each AeBS does not move faster than the maximum speed V max .

Hypergraph convolution mix algorithm for AeBSs
In this section, we propose the HCMIX-DDPG algorithm to decide the positions of AeBSs to achieve maximum system energy efficiency.In HCMIX-DDPG, each agent applies a deep deterministic policy gradient (DDPG) [42] to learn individual action value.DDPG is divided into two networks, the actor network and the critric network, the actor network generates the actions of the agent, and the critic network evaluates the action and generation value of the action.The structure of HCMIX-DDPG is presented in Fig. 2. The input of HGCN network is the observation of each agent o m and the Q value of its action   Reward: As a fully cooperative reinforcement learning scenario, the reward is the system energy efficiency in E.q. (14).

Hypergraph convolutional network
Hypergraph is a generalized graph structure, which can describe the relationship and constraints between nodes more flexibly.
According to Ref. [43], HGCN formula is defined as follows: where x (l+1) is the output of the hypergraph convolu- tion, x (l) is the input of the hypergraph convolution, H is the adjacency matrix of the hypergraph, W is the weight matrix of hyperedges, B is the hyperedge degree matrix, D is the vertex degree matrix of the hypergrap and P denotes the weight matrix between the (l)-th and (l + 1) -th layer.
According to Ref. [30], we build the adjacency matrix H of the hypergraph.As shown in the following equation, O is taken as input to generate the first part of the adjacency matrix of the hypergraph H 1 : The final hypergraph adjacency matrix H is defined as follows: where w i,k represents the weight of learned connection between agents, n refers to the number of agents, m refers to the number of hyperedges, and H a signifies the average value of H 1 .

Mixing network
The adjacency matrix of the hypergraph is constructed by the observations of all agents according to E.q. (15), ( 16), and the independent action value of each agent is substituted into E.q. ( 14) to obtain the improvement value of the cooperation relationship of all agents: where V i is the weight matrix in the i-th layer and Q denotes the improvement value.
We employ value function decomposition in the multiagent actor critique framework [44].The QMIX-DDPG module uses a deep neural network to aggregate the values of each agent into a total value.Q tot stands for total value of joint actions, which can be formulated as: (14) where s is the global state, a n is the action of the n-th agent.
The proposed HCMIX-DDPG algorithm for AeBS deployment is shown in Algorithm 1.
Algorithm 1 HCMIX-DDPG algorithm for AeBS deploymentOur proposed HCMIX-DDPG adopts the centralized training decentralized execution (CTDE) deployment mode [45].The HCMIX-DDPG is divided into two phases, the training phase and the execution phase.In the training phase, strong computing power resources are needed to train all models, which can use the computing power of cloud technology.After fully trained, each actor network is deployed to the corresponding AeBS, and the AeBS makes a decision based on its observations to complete the deployment.The CTDE framework is especially suitable for AeBS-assisted networks since it brings less communication burden to resource-limited AeBSs [46].The overhead of the execution phase is analyzed in the following, we assume that an actor network has N a h hidden layers, which have q a i neurons respectively, i = 1, ..., N a h .The time complexity of HCMIX-DDPG in each iteration can be obtained as O N a h −1 i=0 q a i q a i+1 [47].After the actor networks are deployed on the AeBSs, the AeBSs can be commanded to move according to the real-time observation, and finally the deployment task can be completed.

Simulation results and discussions
Simulation is carried out in a 2 km × 2 km environ- ment.The maximum length of each step in the direction ( x m , y m ) is 100 m, the flight altitude of AeBSs is 20 m, and the observation range of AeBSs is 200 m.The simulation parameters of the environment [40,48] are listed in Table 1 and the hyperparameters of the DDPG are listed in Table 2.According to [49], the environmental parameters for the three different channel conditions are listed in Table 3.
The agents of four deep reinforcement learning algorithms all adopt the DDPG framework, and the details of the four algorithms are as follows: 1. HCMIX-DDPG: In HCMIX-DDPG, the HGCN is introduced into the value decomposition reinforcement learning framework to strengthen the learning of the cooperation relationship of each agent, and the QMIX-DDPG network aggregates the total value.2. QMIX-DDPG: QMIX [44] is a classical value decomposition reinforcement learning framework, which uses a deep neural network to combine the individual values of each agent into a total value.3. VDN-DDPG: VDN [26] is a classical value decomposition reinforcement learning framework, which uses a linear function to combine the values of each agent into a total value.4. IDDPG [50]: IDDPG is a fully distributed algorithm.
In IDDPG, each AeBS makes decisions based on its observation.
Figure 3 illustrates how network performance is affected by varying the number of UEs.By comparing our proposed scheme with the other three DRL algorithms, it can be found that the IDDPG has the worst performance in energy efficiency.This is because the multi-AeBS deployment algorithm performs poorly as a result of the lazy agent problem, which is caused by the fact that IDDPG does not introduce the concept of value decomposition.While QMIX-DDPG uses a deep neural network to aggregate the values of each agent into a total value, VDN-DDPG makes use of a linear function to do so.HCMIX-DDPG adds a hypergraph convolution module based on QMIX-DDPG, and uses the HGCN module to generate the improved value considering the cooperation relationship of agents when generating the value of each agent.Simulation results show that the algorithm of HCMIX-DDPG has the best performance in energy efficiency, and the algorithm of QMIX-DDPG has better performance than VDN-DDPG in the case of different UEs.And by comparing HCMIX-DDPG, QMIX-DDPG, it can be found that the energy consumption of these three algorithms is similar, but HCMIX-DDPG improves the capacity and then improves the energy efficiency of the system.Figure 4 illustrates how network performance is affected by varying the number of AeBSs.It can be found that with the increase of AeBSs, the energy efficiency, capacity, and energy consumption of the system show an increasing trend.Although the increase in the number of AeBS will indicate the total energy consumption of the system, it can better serve the users within the system and improve the total system capacity, thus improving the energy efficiency of the system.By comparing Figs. 3 and 4, we can find that the algorithm performance of HCMIX-DDPG is better than that of other algorithms.In the case of different numbers of AeBSs and different numbers of UEs, HCMIX-DDPG is better than other algorithms in energy efficiency and capacity.   Figure 5 shows the energy efficiency of the system for different numbers of AeBSs under different channel conditions.It can be found that the system energy efficiency is the highest in suburban, followed by urban and the lowest in high-rise urban.This is due to the fact that for AeBSs, the suburban areas have better channel conditions while there are many blockages in the highrise urban environment, which affects the system capacity and thus the energy efficiency.And with the increase in the number of AeBSs, the energy efficiency of the system gradually increases.
Figure 6 shows energy efficiency with different numbers of mmWave beams of the AeBS.Numbers of mmWave beams is N b in Eq. (7).It can be found that when the number of beams increases, the system energy efficiency is higher, which is due to the increase of communication resources of the system.
Figure 7 shows energy efficiency under different mean values of transmit power.In the simulation, the transmit power of each AeBS obeys a normal distribution with a standard deviation of 2. It can be found from the figure that the system energy efficiency decreases as the mean value of AeBS transmit power increases.Although increasing the transmission power can increase the system capacity, it will also increase the communication energy consumption in the energy consumption, which will lead to a decrease of energy efficiency.

Conclusion
This paper models the deployment problem of multi-AeBS as a Dec-POMDP and adopts the MADRL framework to maximize the energy efficiency of the system.An algorithm called HCMIX-DDPG is designed to solve the formulated problem, which combines the value decomposition framework with HGCN, aiming to address the lazy agent problem and strengthen the cooperative relationship between agents in the MADRL settings.The performance of our suggested HCMIX-DDPG algorithm, which can increase the energy efficiency of multi-AeBS systems and improve the efficiency of joint deployment of multi-AeBS, is superior to existing algorithms, according to simulation findings.
w,i + P c,i ].
,y m ,h m ) J tot s.t.C1 : x m ∈ [x min , x max ], ∀m ∈ M, C2 : y m ∈ [y min , y max ], ∀m ∈ M, C3 : h m ∈ [h min , h max ], ∀m ∈ M, C4 : (x m , y m , h m ) � = (x l , y l , h l ), ∀m, l ∈ M, m � = l, C5 : V m ≤ V max , ∀m ∈ M, Reinforcement learning framework Given that the multi-AeBS deployment involves a continuous action space, we employ the Deep Deterministic Policy Gradient (DDPG) framework for each agent.The AeBS is the agent in our HCMIX-DDPG algorithm, and its observation, action, and reward are specified as follows: Observation: Each AeBS has an observation range, the observation of each AeBS is the current position of the AeBS and the position of other AeBSs within its observation range o m = (o 1xm , o 1ym , o 2xm , o 2ym , ..., o nxm , o nym ) , o m ∈ O, m ∈ {1, 2, ..., n} , where o 1xm , o 1ym respectively represent the X-axis position and Y-axis position of the AeBS itself.Other observed values represent the X-axis position and Y-axis position of other AeBSs.When other AeBSs are not in the observation range, the corresponding observation value is 0. Action: Each agent's action is defined as the distance traveled in the X-axis and Y-axis a m = {(�x m , �y m )} , a m ∈ A, m ∈ {1, 2, ..., n}.

Fig. 3 Fig. 4
Fig. 3 Comparison of network performance metrics in different numbers of UEs. a Energy efficiency b Capacity c Energy consumption

Fig. 5 Fig. 6
Fig. 5 Energy efficiency of the system for different numbers of AeBS under different channel conditions

Fig. 7
Fig. 7 Energy efficiency under different mean value of transmit power

Table 1
The simulation parameters of the environment

Table 2
The hyperparameters of the HCMIX-DDPG

Table 3
Channel environment parameters