COCAM: a cooperative video edge caching and multicasting approach based on multi-agent deep reinforcement learning in multi-clouds environment

The evolution of the Internet of Things technology (IoT) has boosted the drastic increase in network traffic demand. Caching and multicasting in the multi-clouds scenario are effective approaches to alleviate the backhaul burden of networks and reduce service latency. However, existing works do not jointly exploit the advantages of these two approaches. In this paper, we propose COCAM, a cooperative video edge caching and multicasting approach based on multi-agent deep reinforcement learning to minimize the transmission number in the multi-clouds scenario with limited storage capacity in each edge cloud. Specifically, by integrating a cooperative transmission model with the caching model, we provide a concrete formulation of the joint problem. Then, we cast this decision-making problem as a multi-agent extension of the Markov decision process and propose a multi-agent actor-critic algorithm in which each agent learns a local caching strategy and further encompasses the observations of neighboring agents as constituents of the overall state. Finally, to validate the COCAM algorithm, we conduct extensive experiments on a real-world dataset. The results show that our proposed algorithm outperforms other baseline algorithms in terms of the number of video transmissions.


Introduction
As the Internet of Things (IoT) technology evolves, users are becoming increasingly interconnected with their electronic devices [1].The advent of new wireless networks such as the fifth-generation (5G) network, the proliferation of smart devices, and users' high usage of diverse applications such as video streaming, online gaming, and virtual reality have resulted in a profound surge in video traffic.According to the Cisco report [2], studies predicted that the traffic of video types would account for 79% of all Internet traffic worldwide by 2022.The extensive prevalence of video traffic and the stringent quality of experience (QoE) requirements have put tremendous backhaul pressure on networks [3].Therefore, the issue of minimizing the network resource consumption during transmission while simultaneously satisfying user demand has become one of the most critical concerns of network operators [4].
In traditional cloud environments, the service process requires moving data to remote data centers for centralized computing and storage.This leads to high network transmission latency, which can negatively impact the performance of mobile applications.To address this problem and provide reliable services for latency-sensitive applications, researchers have explored deploying small-scale cloud servers at the edge so that these edge cloud servers can provide resources closer to edge IoT devices [5][6][7].Edge cloud servers are equipped with finite resources and can be utilized to deliver bandwidth-optimized services at the edge, thus enabling the provision of fast and immediate services [8,9].The multi-clouds architecture, including the remote cloud and edge clouds, is a promising paradigm to improve the QoE of users and reduce energy consumption [10,11].This potential stems from its ability to facilitate ubiquitous caching and efficient content delivery for end users, as highlighted by several studies.
During the content request phase, the network engages in content searching upon receiving a user's request.To alleviate traffic congestion, edge caching is an efficient manner of caching popular files on edge cloud servers closer to their requesters.It tackles the problem of which content to be cached in the edge cloud [12].Recent scholarly investigations have substantiated the effectiveness of collaborative caching, which has attracted considerable scholarly attention.Collaborative caching works by allowing edge clouds to collectively distribute content through internal connections.Song et al. [13] presented an adaptive cooperative caching scheme that incorporates an enhanced quantum genetic algorithm to address the energy-delay tradeoff problem.Zhang et al. [14] proposed a spatially cooperative caching strategy for a two-tier heterogeneous network.The objective of this strategy is to minimize storage usage for duplicated content with caching while maximizing the likelihood of successful content retrieval (hit probability).
During the content delivery phase, traditional unicast mechanisms for distributing content from remote cloud to edge clouds and user ends (UEs) result in inefficient delivery.Multicasting, on the other hand, can leverage the available network bandwidth to deliver the same content to multiple receivers, benefiting from the similarity of users' preferences for content in close geographic locations.This mechanism reduces traffic generated during delivery by delivering the requested file through a single multicast rather than multiple unicasts [15].Significant efforts have been devoted to video coding and multicast transmission [16][17][18][19].For instance, Guo et al. [18] proposed a layer-based multiquality multicast beamforming scheme based on scalable video coding.Wu et al. [19] designed an adaptive video streaming scheme using named data network multicast.However, these algorithms, while addressing video coding and multicast transmission, did not consider the integration of coded multicasting with caching in a cooperative environment.
Intuitively, Caching reduces latency and network bandwidth consumption by serving frequently requested content locally at the edge clouds [10,20].Multicasting further reduces bandwidth usage by efficiently delivering popular content to multiple users simultaneously, especially in scenarios with concurrent requests for the same content.Joint consideration of caching and multicasting can enhance the overall network performance and resource utilization by dynamically allocating caching and multicasting resources based on real-time user demand and network conditions.This adaptive strategy optimizes the content availability and delivery efficiency, leading to an improved user experience.Notably, it facilitates the deployment of various latency-sensitive applications and services [21,22].In the context of large-scale cache-enabled wireless networks, Jiang et al. [23] applied an iterative numerical algorithm to analyze and optimize caching and multicasting.Various coding multicasting mechanisms have been proposed in different scenarios [24][25][26][27].Nevertheless, in large-scale cooperative caching scenarios, finding a balance between edge caching and multicasting to improve resource efficiency remains a challenging task.
In this paper, we exploit the benefits of mobile edge caching with multicasting in the multi-clouds environment to reduce network transmission consumption.We investigate the collaborative caching among different edge clouds to effectively adapt to dynamic edge environments.We propose a multi-agent DRL-based approach for COoperative video CAching and Multicasting named COCAM to minimize the average transmission number, thereby enhancing video delivery efficiency.Our main contributions are summarized as follows: • We investigate the cooperative video edge caching and multicasting issue to reduce the transmission number in the multi-clouds scenario.Moreover, we present the problem formulation as a multi-agent Markov decision process (MDP).

Caching algorithms
Edge caching stores popular content locally on edge clouds, allowing them to deliver the requested content directly to users.It significantly reduces network latency and network consumption.Li et al. [28] [30] proposed a deep actorcritic reinforcement learning algorithm to address the dynamic control of caching decisions by enabling each edge to learn an optimal policy through self-adaptation.However, the existing research primarily focused on content caching policies and did not incorporate consideration for the content delivery process.

Multicasting algorithms
Multicast transmission is extensively utilized in edge networks, demonstrating its efficacy in enhancing network performance by reducing bandwidth, routing, and cost [31].Damera et al. [32] constructed a new feasible architectural model to transmit the required content to the user using the multicell transmission.The Signal Noise Ratio was improved using the multicell transmission.The optimized MEC scheduling algorithm showed better performance compared to the existing model.Zahoor et al. [33] proposed a suggested enhanced eMBMS network architecture to address the significant limitations of the standard eMBMS architecture, i.e., a network architecture using network function virtualization (NFV) and MEC.The proposed architecture allows the multicasting of crowdsourcing live streams.Ren et al. [15] considered the fundamental issues of NFV-enabled multicast in mobile edge clouds and designed a heuristic algorithm.Qin et al. [34] studied the multicast traffic for IoT applications in edge networks under the delay-oriented network slicing problem.
Nevertheless, these works focused on the network architecture and multicast protocols without integration with the practical applications of the edge cloud servers.

Joint caching and multicasting algorithms
The utilization of multicast transmission at the base station, enabling concurrent servicing of distinct user requests for the identical file, is recognized as a highly efficacious approach for supporting the delivery of extensive content over wireless networks.This approach is regarded as an effective strategy in wireless communications to meet the constantly increasing demand for content transmission.Maddah-Ali et al. [35] used the joint encoding of multiple files and the multicasting feature of downlink channels to optimize content placement and delivery under encoded multicast.They also evaluated the caching gain and demonstrated that the joint optimization problem could improve the caching gain.Liao et al. [36] used the benefits of multicast content delivery and collaborative content sharing jointly to develop a compound caching technique (multicastaware cooperative caching).He et al. [37] designed partial caching bulk transmission and partial caching pipelined transmission to reduce the delivery latency of cacheenabled multi-group multicast networks.Somuyiwa et al. [38] combined active caching and multicast transmission to model the single-user multi-request problem as an MDP and used a DRL approach to solve the problem.
Since traditional approaches are difficult to adapt to this highly diverse and dynamic environment under multiclouds cooperative caching, we propose a COCAMbased framework to maximize the traffic consumption during the video delivery phase.

System model and problem formulation
In this section, we introduce the cooperative video edge caching and multicasting model and give concrete definitions.Then, we state the corresponding cache decision-making problem.For convenience, we summarize some key modeling parameters and notations in Table 1.

Network model
We consider the multi-clouds system, which consists of three types of layers: the remote cloud layer, the edge cloud layer, and the UE layer.Assuming that the remote cloud provides all the requested video files F = {1, 2, • • • , F } .Since video service generally frag- ments a video into equally sized chunks, we assume all files are unit-sized.The set of edge cloud servers can be denoted as N = {1, 2, • • • , N } .We denote the time slot of requests as T = {1, 2, • • • , T } .The edge clouds receive the requests and make the caching decision at each time slot t.At each time slot t, the edge clouds receive requests and determine caching decisions.The request received by edge cloud n for file f is denoted as q f t,n ∈ {0, 1} , where q f t,n = 1 represents a request for file f, and q f t,n = 0 signi- fies no request for file f.A variable x f t,n is used to denote the transmission decision, i.e., whether the requested video f is transmitted from the remote cloud to the edge cloud n at time t.If no, we have x f t,n = 0 , and x f t,n ∈ (0, 1] otherwise.x f t,n = 1 means a transmission channel is fully used by edge cloud n and it occurs only under unicast conditions.Otherwise, if multiple edge cloud servers share a channel under one of the multicast conditions, we assume these edge clouds share the channel equally.

Caching model
At each time t, we assume that only one UE under the edge cloud server n will request the video.For each UE, if the requested video has been cached in the upper edge cloud, the edge cloud server can deliver it to the UE directly.Else the edge cloud server requests the file from the remote cloud.
Each edge cloud has the same maximum capacity C. We use a binary variable y f t,n to indicate whether the requested video f has been stored in edge cloud n at time t.If yes, we have y f t,n = 1 , and 0 otherwise.Each server stores content limited to its maximum storage capacity: After the edge cloud gets the requested video, the edge cloud will decide whether to cache the content or not.If the edge cloud storage capacity is not fully filled, we store the video directly.Otherwise, we update our caching space based on the policy. (1)

Transmission model
The remote cloud delivers the videos to the requested edge clouds.Figure 1 gives four schemes in our cooperative transmission model which are described as follows: • Localcast (LC): If the requested video has been cached in the local edge cloud server at time t, the UE can fetch it from the edge cloud directly without requesting from the remote cloud.We use

G
The video set through the XC scheme The number and set of videos The variable whether the requested video f is transmitted from the remote cloud to edge cloud n at time t The variable whether the requested video f has been stored in edge cloud n at time t The The advantage function where the video set through the XC scheme can be denoted as: The XOR set receives the XOR-encoded bit stream by one transmission.Then, each edge cloud restores its video by decoding the received bit stream with the contents stored in its cache.We have: (3) as shown in the XC part of Fig. 1, N 4 and N 5 simultaneously request f 5 and f 4 that have been cached not by themselves but by each other.We denote the coded XOR information as f.If there are multiple XC combinations, we choose the combination that will generate the smallest number of XC sets with the participation of the same number of edge clouds.This preference is based on the effectiveness of our proposed XC approach in significantly reducing internal energy consumption during unicast transmission.While this paper does not explicitly consider the energy consumption associated with XOR operations, it is important to acknowledge that such operations still entail a nonnegligible energy overhead.Considering a fixed number of edge clouds, our objective is to minimize the number of XC combinations to mitigate the impact of XOR energy consumption.
• Unicast (UC): When the relationship between the requests from edge clouds and the cache list does not satisfy any of the above cases, edge clouds fetch videos directly from the remote cloud by establishing a transmission channel.We denote the UC set as We have:  To use fewer transmissions to deliver all the data during the delivery process, we use the network coding technique.The transmitted content is encoded at the network nodes and then decoded at the destination.We use XOR coding techniques.These edge clouds have not cached the requested video but have cached the video requested by other edge clouds.The caching policy determines what will be cached in the edge cloud, and then the remote cloud classifies the transmission based on the cache state in the edge clouds.According to the above four cases, we can formulate the joint multicast transmission and cache replacement problem that aims to minimize the total number of transmissions from the remote cloud to the edge cloud as:

The COCAM approach
Our modeling problem is a mixed integer programming (MIP) problem [22], which is strictly NP-hard.Solving MIP problems with traditional computational methods has been proven challenging in natural caching systems with low computational efficiency.Thus we consider a learning approach.We explore the collaboration between different edge cloud servers with a multi-agent reinforcement learning-based algorithm to better adapt to dynamic edge environments.
In this section, each edge cloud operates as an independent agent, while maintaining a cooperative relationship with other edge clouds.We model the cache decision-making problem as a multi-agent extension of the Markov Decision Process (MDP) and introduce a novel multi-agent actor-critic-based caching approach.Our proposed approach aims to minimize the average number of transmissions during the request transmission process.Multi-agent reinforcement learning consists of agents and the environment.Based on the state and the reward from the environment, each agent executes an action according to its certain strategy.Then the environment changes to a new state.An MDP is a mathematical framework for modeling (8) 2), ( 3), ( 6), ( 7) sequential decision-making consisting of state, action, transition probability, and reward.Each agent learns the optimal decision-making sequence through continuous interaction with the environment.We define the basic elements of a multi-agent MDP as follows:

State
The state of agent n at time t be denoted as s t,n = {y t,n , q f t,n } , where q f t,n indicates the current request demands and y t,n = {y f t,n } ∀f ∈F denotes the caching state of edge cloud n.We define the neighborhoods that can be observed by the agent n as N n .We use π t,n to denote the policy of agent n.Thus, the adjacent agent policy of agent n can be denoted as π t,N n .Each agent can observe the states and policies of the neighborhoods.Therefore, the joint state of an agent n to be fed into the input network is ŝt,n = {s t,m } ∀m∈{n,N n } .

Action
An agent decides which video should be replaced from the cache list based on its policy.We denote the action of agent n as a t,n = v , where v ∈ {0, 1, 2, • • • , C} .If v = 0 , the requested video will not be cached.Else, the v-th content in the cache space of edge cloud n will be replaced by the current requested video.

Reward
The goal is to minimize the average transmission number.We define the negative value of the transmission number as the reward: So the global reward is calculated as:

Network architecture
As shown in Fig. 2, each agent consists of two parts: actor network (as θ ) and critic network (as ω ).The actor net- work and critic network are essential components of a policy network.The actor network receives environmental states as input and generates corresponding action outputs, aiming to learn an optimal policy π θ n that maxi- mizes the expected return or value function associated with accumulated rewards.On the other hand, the critic network serves as a value function estimation network, evaluating the quality of actions chosen by the actor network within a given state.Its primary objective is to learn (12) a value function V ω n capable of estimating the expected return or value based on the current state and the actions selected by the actor network.The actor network consists of two fully connected hidden layers with ReLU activation functions, where the dimensions are determined by the variable state size of the cache.Its output layer is a fully connected layer utilizing a hyperbolic tangent (tanh) activation function.Similarly, the critic network shares the same architecture as the actor network, comprising two fully connected hidden layers with ReLU activation functions.The critic network's output layer consists of a single unit activated by a linear function.Each network contains a target network and a primary network of the same network structure.We use the target network to improve the stability and convergence of training.After the primary network learns a certain number of times, the parameters of the primary network are used to update the parameters of the target network.
Agents get the policies π based on their actor networks.The actor network is denoted as a function to seek optimal policy π t,n = π θ n (a t,n |ŝ t,n , π t−1,N n ) , where θ n denotes the actor network parameter of agent n.An agent gets the action by random sampling with the policy distribution.We denote the parameter of the critic network for agent n as ω n .Thus, V ω n denotes the value function of the critic network trained as an estimate of the expected reward.
We formulate the expected value equation for edge cloud n as: where γ denotes the discount reward factor.At each time t, the agent stores the experience (ŝ t,n , a t,n , r t,n , ŝt+1,n ) in replay memory B.
We use the temporal difference (TD) algorithm to update the critic network.The loss function of the critic network can be calculated as: The actor network is updated by the policy gradient (PG) algorithm.The loss function of the actor network can be defined as: where the β denotes a hyperparameter to control the entropy term, and the advantage function Ãt,n = R t,n − V w n ŝt,n , π t−1,N n is the discounted reward minus a baseline.
Then we update the target network parameters for each agent n as: (16) Fig. 2 The COCAM approach where ζ denotes the target network update parameter.The target network is updated every τ step.After the training is completed, each agent can get the most effective action strategy in the current state according to its own state in each execution step.

Algorithm 1
The COCAM AlgorithmThe COCAM algorithm is given in Algorithm 1.Each local agent collects the experience tuple by following the current policy until enough samples are collected for batch updating (lines 8 to 15).Then a batch will be sampled randomly to update the actor and the critic network (lines 17 to 20).For every τ step, the target network is updated (lines 21 to 23).

Experiment setup
We conduct experiments on a real-world dataset from iQIYI which contains 300,000 individual videos watched by 2 million users over two weeks.We randomly select 10,000 records from it.Figure 3 illustrates a descending order trend in video request preferences observed in the iQIYI dataset.The popularity distribution of videos exhibits notable skewness, adhering to a Zipf distribution.This implies that a small subset of highly popular videos significantly contributes to the majority of access volume, while a large number of other videos receive minimal attention.Popular videos are frequently accessed, necessitating regular updates to their cached content.Conversely, a substantial proportion of less popular videos are rarely accessed, rendering them (18) ineffective for caching purposes.However, despite their limited popularity, these less popular videos still contribute to users' demand.Therefore, it becomes imperative to design an adaptive cooperative caching and multicasting strategy that captures the distribution and dynamics in video popularity.We divide the dataset into 30 edge areas based on geographic information with the K-means algorithm [39].We select 20 to deploy edge cloud servers (i.e., agents) to provide the video service for users.By default, we set the cache size to 50.We assume that each agent can observe the states of all the other agents from the environment.The key experimental parameters are listed in Table 2.

Comparisons and results
The contents in different edge cloud servers are related to each other in multicast delivery, leading to the tendency of multicast caches to store similar contents.In contrast, for cooperative caching, the cached contents in different units should be mutually exclusive for better utilization of the limited storage space.The combination is balanced by using multi-agent reinforcement learning in the combination.
Figure 4 shows the variation of transmission number of COCAM with the increasing training episode.We  According to Eq. ( 8), we measure the performance of our proposed algorithm using the average number of transmissions during the entire process of requesting.The average number of transmissions can show the efficiency of multicast transmission, which is affected by the caching decision.A lower average number of transmissions means fewer channel resources and higher multicast transmission efficiency for the same request, which can effectively relieve network transmission pressure.
To evaluate the performance of the COCAM algorithm, we compare it with other algorithms in different cases.

Comparison with non-cooperative caching algorithms
In Fig. 5, we compare the COCAM algorithm with non-cooperative caching algorithms under cooperative transmission in terms of the number of transmissions.
LRU [40]: The new content will replace the cached content which has been least recently requested.LFU [40]: The new content will replace the cached content which has been least frequently requested.FIFO [41]: The new content will replace the cached content which has been stored earliest.Lecar [42]: It adopts LRU or LFU algorithm to update the cache according to the weight adaption by regret minimization technique.Arc [43]: It dynamically adjusts the size of the two queues and performs cache updates based on the LRU algorithm.
In these caching algorithms, each edge cloud server individually caches the content based on its caching decision without combining the cooperative caching among the edge clouds.
Figure 5a shows the comparison result under different numbers of requests.The request numbers are set ranging from 300 to 1500.Our COCAM algorithm performs better than the other baselines, with an average improvement of 2% to 15% in global benefits.Besides, the variations in the number of requests hardly affect the performance except for the LRU algorithm.It is because LRU works better for popular content and tends to lead to cache pollution in smooth datasets.Figure 5b shows the performance comparison under different edge cloud cache sizes.The cache size ranges from 30 to 90.
From Fig. 5b, it is observed that the transmission number decrease as the edge cloud cache sizes increase for all methods.Since the requested videos are more likely to be hit locally or built a multicast transmission as the cache capacity increases, our COCAM algorithm performs better than the other baselines.
Figure 5c shows the comparison under different amounts of edge clouds.We set the edge cloud server numbers ranging from 5 to 25 with a cache size of 50.We compare the results after 1500 requests.We can see that COCAM achieves the minimum transmission number.The performance of our algorithm is significantly better

Comparison with cooperative caching algorithms
In Fig. 6, we compare the COCAM algorithm with A2C algorithms that apply cooperative caching under cooperative transmission.A2C [44]: This algorithm uses a single-agent advantage actor-critic algorithm to select the action with the best reward.
As seen in the figure, the two cooperative caching algorithm curves converge, with the COCAM significantly outperforming the A2C algorithm.Compared to the A2C algorithm, our proposed algorithm results in an average improvement of 4 % .It is mainly because COCAM yields more intelligent decision-making that learns the dynamic request pattern based on the global state.The performance of the learning-based algorithm can adapt well to the multicast environment and is not significantly affected by the variations in the number of edge clouds.With the cooperation of different agents, COCAM shows better and more stable performance.two altered algorithm curves are closer in results, indicating MC scheme is less effective on this dataset.This phenomenon can be attributed to the observation that users within the same region tend to have similar request preferences, while their activities of accessing the same content may vary across different time slots.It indicates that our XC scheme can effectively leverage this insight to achieve superior performance in the integrated caching and multicasting scenario.

Conclusion
In this paper, we have proposed a joint cache replacement and multicast transmission strategy in the multi-clouds scenario.This strategy could reduce the transmission number efficiently for video delivery.We have designed a multi-agent actor-critic algorithm named COCAM, enabling multiple edge clouds to cooperate to achieve intelligent caching decisions.In addition, we have conducted experiments on a real-world dataset.The evaluation results have shown that our COCAM algorithm could reduce the average transmission number by cooperation between different agents compared to other baselines.In our future work, we will further enhance reinforcement learning algorithms to achieve improved adaptation in resource-constrained and bandwidth-limited multi-clouds environment at a large scale.
denote the set of edge clouds from which UEs can get videos at time t through LC schema without fetching from the remote cloud.We have: as shown in the LC part of Fig. 1, N 1 requests f 1 , f 1 has been stored in N 1 .• Multicast (MC): If the requested video has not been cached in the edge cloud, then the edge cloud requests the file from the remote cloud.If there are other different edge clouds requesting the same video f at time t, then these edge clouds can obtain the requested video f through MC schema.We use

Table 1
= 0, ∀n ∈ N LC , Summary of important notations Notations Definition β The hyperparameter of the entropy term γ The discount factor N, N The number and set of edge clouds = 0, ∀n ∈ N \ N LC } to denote the set of edge clouds that can use multicast transmission to obtain the requested video f.We have:as shown in the MC part of Fig.1, N 2 and N 3 simultaneously request f 2 that have not been cached.•XOR-cast (XC): We form a special edge cloud set named exclusive OR (XOR) set where each edge cloud in the set stores the video files requested by the other edge clouds.We denote this set as: f t,n = 1, ∀n ∈ N UC .

Fig. 3
Fig. 3 Number of requests of a content versus its rank on iQIYI dataset

Fig. 4 Fig. 5
Fig. 4 The values of transmission number in the training process of COCAM

Figure 7
Figure 7 shows the performance of multicast transmission and coding transmission during the delivery phase in the cooperative caching scenario.COCAMw/o-MC &XC: We design the COCAM-w/o-MC &XC by using COCAM without using the part of MC and XC.COCAM-w/o-XC: We design the COCAM-w/o-XC by using COCAM without using the part of XC.The experimental results illustrate that our proposed COCAM algorithm works better than the design-altered COCAM algorithms.It shows that our proposed MC and XC schemes effectively reduce the transmission number.As shown in the figure, the

Fig. 6 Fig. 7
Fig. 6 Performance comparison with cooperative caching algorithms request for file f received by the edge cloud n at time t

Table 2
Simulation parameters