- Research
- Open access
- Published:
When wavelet decomposition meets external attention: a lightweight cloud server load prediction model
Journal of Cloud Computing volume 13, Article number: 133 (2024)
Abstract
Load prediction tasks aim to predict the dynamic trend of future load based on historical performance sequences, which are crucial for cloud platforms to make timely and reasonable task scheduling. However, existing prediction models are limited while capturing complicated temporal patterns from the load sequences. Besides, the frequently adopted global weighting strategy (e.g., the self-attention mechanism) in temporal modeling schemes has quadratic computational complexity, hindering the immediate response of cloud servers in complex real-time scenarios. To address the above limitations, we propose a Wavelet decomposition-enhanced External Transformer (WETformer) to provide accurate yet efficient load prediction for cloud servers. Specifically, we first incorporate discrete wavelet transform to progressively extract long-term trends, highlighting the intrinsic attributes of temporal sequences. Then, we propose a lightweight multi-head External Attention (EA) mechanism to simultaneously consider the inter-element relationships within load sequences and the correlations across different sequences. Such an external component has linear computational complexity, mitigating the encoding redundancy prevalent and enhancing prediction efficiency. Extensive experiments conducted on Alibaba Cloud’s cluster tracking dataset demonstrate that WETformer achieves superior prediction accuracy and the shortest inference time compared to several state-of-the-art baseline methods.
Introduction
With the rapid advancement of information technology, cloud computing has become integral to modern information infrastructure [1, 2]. Load is a crucial metric for assessing the performance and efficiency of cloud platforms. It can be recorded as time series data through monitoring tools, facilitating the analysis of computing, storage, and network resource status. In the operations and management of cloud platforms, load prediction involves a thorough analysis of current and historical usage patterns [3] of resources to predict future trends, thereby enabling efficient and rational resource planning [4,5,6]. Precise load prediction assists cloud platforms in formulating effective resource allocation strategies [7], ensuring the stable operation of data centers [8, 9]. Besides, prediction efficiency is another significant factor that affects the stability of cloud platforms. Efficient prediction allows for timely responses to workload changes, providing data centers with real-time and flexible scheduling strategies to ensure continuous and reliable service delivery. Hence, it is emerging yet challenging to create a lightweight solution for reducing the consumption of computing resources [10] and enhancing the overall operational efficiency of the data centers [11, 12].
Load trends in cloud data centers typically exhibit randomness and volatility, containing a mix of linear and nonlinear characteristics [13]. Traditional cloud load prediction approaches [14] primarily rely on statistical or machine learning techniques [15]. These methods offer rapid convergence rates and low computational complexity. However, they necessitate sequences with definite periodicity or regularity and only achieve sub-optimal performance while measuring nonlinear and non-stationary load sequences. With the advancement of deep learning in time series prediction, researchers have shifted their interest to Recurrent Neural Networks (RNNs) [16] and their variant methods (e.g., Long Short-Term Memory (LSTM) [17] and Gate Recurrent Unit (GRU) [18]) to process the load sequences [19]. RNNs excel in capturing the extensive nonlinear relationships within sequences and are adept at handling contextual dependencies. Nevertheless, they are prone to get trapped in local optima when addressing long-term sequence prediction, and as sequence length increases, errors can accumulate. Subsequently, scholars have adopted Transformer-based methods [20] to model the global correlations within sequences [21, 22]. These approaches employ multi-head self-attention mechanisms as their core architecture, which demonstrates exceptional performance in capturing long-range dependencies and interactions in sequential data. However, the quadratic space and time complexity of the self-attention mechanism inflates computational consumption and training duration [23], hindering efficient real-time processing of long-time series data and limiting their application in large-scale cloud data centers. Furthermore, load sequences in cloud data centers are influenced by a combination of interrelated factors, such as user types, task complexity, and server operational hours [24]. Transformer-based methods, which primarily extract feature information on the temporal scale of sequences, struggle to uncover the intrinsic relationships among multiple factors hidden within load sequences [25].
To address the limitations of existing methods, we propose a lightweight load prediction model named WETformer. It is an encoder-decoder structure that contains two significant components, i.e., the wavelet decomposition module and the multi-head External Attention (EA) module. Specifically, we introduce a multi-scale wavelet transformation to decompose the forecasting sequence into high-frequency and low-frequency components. Thus, the WETformer can measure the slowly varying parts and rapidly changing aspects of the sequence (i.e., the global preferences and the local features), respectively. Then, inspired by the successful application of EA in object detection tasks, we propose a lightweight multi-head EA mechanism to reduce the computational complexity while considering the global correlations of load sequences. The encoder progressively extracts internal interactive information via alternating decomposition and feature extraction while the decoder models the trends from the hidden variables and fuses them with the separated detailed information. Such structure allows the WETformer to accurately capture the temporal evolution and interrelations among all the sequences. The main contributions of this paper can be summarized as:
-
We propose a lightweight yet effective load prediction model, namely WETformer. This model integrates wavelet transformation and an external attention mechanism, enabling fast and accurate predictions of load conditions.
-
We employ wavelet transform techniques to decompose the load sequence into different frequencies, highlighting the trending part of the load. Then, we exploit the high-frequency information to capture the temporal evolution from the load sequences.
-
We integrate the EA mechanism into the Transformer architecture, significantly reducing computational complexity and conserving prediction inference time.
-
Extensive experiments conducted on three real-world datasets demonstrate the superiority of WETformer in prediction accuracy and lightweight performance compared with several state-of-the-art baseline methods.
The rest of the paper is organized as: “Related work” section provides a comprehensive review and analysis of the current state of research and the challenges faced in cloud platform load prediction, delineating the rationale behind the approach presented in this paper. “Cloud platform load prediction model” section focuses on establishing the proposed model and algorithms, along with a discussion of their properties. “Experiment and result analysis” section demonstrates the feasibility and effectiveness of the proposed method through experimental validation. Finally, “Conclusion” section concludes the paper and outlines future research directions.
Related work
To address the challenge of load prediction in cloud computing environments, researchers have proposed various methods aimed at enhancing prediction accuracy to optimize resource allocation and improve system performance. Generally, these methods fall into three categories: statistical learning, RNNs, and Transformers.
Statistical learning prediction methods
In response to the highly variable cloud loads, Gupta et al. [26] proposed an online adaptive method for predicting cloud resource usage time series. This method converts the batch-processing Autoregressive Integrated Moving Average (ARIMA) model into an online model to handle the streaming nature of time series data. Online gradient descent is then used to update the model parameters at each step to cope with the impact of error propagation in iterative multi-step predictions. Finally, fractional differencing is used to capture long-term dependencies in the data. Building on a deep understanding of the characteristics of cloud computing tasks, Zhong et al. [27] introduced a cloud computing load prediction model based on Wavelet Support Vector Machines (WSVM). This model replaces the kernel function of SVM with wavelet functions, combining the time series decomposition features of wavelet transform with the nonlinear regression analysis of SVM to enhance the accuracy of load sequence regression analysis. Kumar et al. [28] proposed a load prediction model that utilizes neural networks and an adaptive differential evolution algorithm. The model learns and extracts patterns from the load, which are then used for further predictions. It employs an evolutionary approach for training, using a set of solutions with uniform distribution to explore the solution space from multiple directions, thus avoiding the influence of initial solution selection. Rahmanian et al. [29] presented an integrated prediction algorithm based on learning automata, which combines the predictions of multiple models, each assigned a weight based on its performance. The learning automata theory is applied to assess the feedback of each prediction model, increasing the weight of models that perform well in predictions and decreasing the weight of those that do not. The final prediction is a weighted combination of the individual model predictions. Gao et al. [30] proposed a clustering-based load prediction method. This method first clusters all tasks into multiple categories, grouping tasks with similar load patterns together. It then trains prediction models for each category to capture the distinct features within each task category and uses the corresponding model for each task to predict its load.
RNNs prediction methods
Among the various methods in cloud load prediction, RNNs have received considerable attention and application due to their outstanding mechanism for learning temporal features and their ability to handle nonlinear relationships. Cheng et al. [31] proposed a novel hybrid load prediction method by combining GRU models with Exponential Smoothing (ES). This method first obtains intermediate results from the GRU model predictions and then applies exponential smoothing to these results for better refinement of the prediction accuracy. Zhu et al. [32] integrated an RNN model with an attention mechanism to introduce an attention-based LSTM encoder-decoder method for load prediction. This method maps historical load sequences into a fixed-length vector through an encoder network, extracting the order and context features of the historical load data. The attention mechanism is incorporated into the decoder network, which maps the context vector back to a sequence for the prediction of batch loads. The raw load and resource usage time series data in cloud platforms often contain noise caused by physical machine failures and other abnormal conditions. To mitigate the impact of such anomalies on prediction outcomes, Bi et al. [33] applied the Savitzky-Golay filter to smooth the load sequences, thereby eliminating outliers and noise. Subsequently, they developed the BG-LSTM prediction model for load and resource usage time series by integrating Bidirectional Long Short-Term Memory networks (BiLSTM) with Grid Long Short-Term Memory networks (GridLSTM). In this model, the BiLSTM layer is tasked with analyzing the time series surrounding the current period, capturing bidirectional dependencies, and encoding information from the reverse period into the current one. The GridLSTM layer then examines the time series across the depth dimension, yielding outputs that reflect both the temporal and frequency domains of the sequence. Predić et al. [34] employed variational mode decomposition to break down intricate cloud load sequences, extracting meaningful patterns from the non-stationary host load data. They then utilized these decomposed modes to train both regular RNN and RNN models enhanced with attention mechanisms. To refine these models, they optimized the hyperparameters using particle swarm optimization. Dogan et al. [35] leveraged DWT to decompose the input data into various frequency subbands. They introduced an attention mechanism to capture the relevant features within the load sequence, which were fed into the BiGRU network.
Transformer prediction methods
The Transformer is a deep learning model based on the self-attention mechanism, widely applied in various sequence generation tasks. The core principle of this approach is to leverage self-attention mechanisms to capture the interdependencies among elements within a sequence, emphasizing salient pattern features. This enables the model to automatically balance the local relevance of any input and assign greater weights to elements with higher correlation. Qi et al. [36], to predict the demand for various tasks in data centers for better resource allocation and task scheduling, proposed Performer, a large-scale time series prediction model based on the Transformer. This model treats each load as a local dataset, with all worker threads collaborating to train a global prediction model while individual threads train local models. By combining global and local models with an encoder-decoder architecture and employing self-attention, the model achieves good prediction accuracy with lower computational costs. Wu et al. [37] proposed a long-term load prediction framework. The framework employs the Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) method to perform a stepwise decomposition of the input time series data, extracting high-frequency and low-frequency features. Subsequently, the low-frequency features are recursively added to the high-frequency features to generate new inputs. These processed features are then fed into the variant of the Transformer model, known as the Informer, to predict long-term cloud loads. To effectively handle univariate cloud resource load time series with multidimensional latent factors, Zhu et al. [38] propose the Variational Model Decomposition-based Sample Entropy-optimized Transformer (VMDSETformer) for cloud resource scheduling. This approach employs variational model decomposition to break down the time series, utilizing sample entropy calculations to reconstruct subsets of sequences. Subsequently, a Transformer-like framework equipped with a multi-head self-attention mechanism is employed to learn deep features and obtain encoded representations for each component sequence. In pursuit of developing a highly accurate cloud load prediction model with reduced inference overhead, Shivani et al. [39] propose the WGAN-gp-Transformer, a novel time series prediction model inspired by the Transformer network and enhanced Wasserstein Generative Adversarial Networks (GANs). This method employs the Transformer network as the generator and employs a Multilayer Perceptron (MLP) as the critic. Compared to the conventional LSTM model, the WGAN-gp-Transformer yields lower prediction errors and demonstrates enhanced prediction efficiency.
Comprehensive analysis
In our research, we conducted a systematic comparison and analysis of the proposed method against existing ones, focusing on four main aspects: (1) The effectiveness in addressing error accumulation and gradient disappearance issues prevalent in long sequence prediction; (2) The capacity to conduct detailed analysis of the diverse potential patterns or components within the load sequences; (3) The extent to which these methods account for the inter-sequence and inter-batch correlations; and (4) The advancement in model lightweight, specifically in terms of reduced inference time. The comparison outcomes are presented in Table 1.
A load prediction model based on statistical learning has been widely applied due to its relatively low computational complexity. However, with the advancement of cloud computing technology and the deepening of its applications, these models exhibit clear limitations in handling the nonlinear relationships and dynamic changes in complex cloud environments [17]. Firstly, most statistical cloud load models are based on simplified assumptions, leading to significant deviations between model predictions and actual operational conditions. Secondly, statistical learning models face difficulties in dealing with the interdependencies and coupling relationships between resources [19]. For instance, the surge in computational resources might necessitate additional storage and network bandwidth, a factor often overlooked in statistical models. RNNs and their variants have shown significant performance advantages in cloud load prediction but still face challenges in practical applications. For example, when handling long sequence data, RNNs may encounter issues with vanishing or exploding gradients, which hinder the model’s effective capture of long-distance dependencies. Additionally, the computational process of RNNs is sequentially dependent, meaning that each time step must wait for the previous one to complete, which limits the model’s parallel processing capabilities [38]. Transformer models, with their global modeling capabilities through the self-attention mechanism, can capture long-distance dependencies in time series data and offer improved prediction accuracy compared to traditional models and RNNs. However, their computational complexity is proportional to the square of the input sequence length, leading to a sharp increase in computational costs when dealing with longer sequences, thus extending the model’s response time [23]. Moreover, the high computational complexity results in increased memory resource consumption, limiting the model’s widespread adoption and application in resource-constrained environments.
To address the aforementioned challenges, this paper leverages the Transformer architecture, incorporating a wavelet decomposition module to extract feature information from load sequences across different frequency domains. Subsequently, we introduce a lightweight EA module to replace the self-attention mechanism. This design aims to achieve precise prediction while reducing the model’s parameter count, thereby enhancing its suitability for real-world cloud computing environments.
Cloud platform load prediction model
In this section, we explore the characteristics of loads in large-scale cloud platforms, which host diverse types of cloud tasks. Each task requires various resources, such as CPU, memory, and I/O, during its execution. The cloud platform monitors the resource usage of each server and stores this information in the form of time-series logs. Analyzing these logs allows for a deeper understanding of the temporal characteristics of the load, enabling us to predict trends in server resource utilization and allocate resources more efficiently for user tasks [40]. The dynamic changes in cloud data center loads are influenced by multiple factors, such as time of day and network bandwidth, each with varying frequencies of occurrence. To analyze the relationship between different potential factors and the load, and to highlight the long-term trend information of the load, we apply wavelet transform to decompose the load sequence into sub-sequences of different frequencies. Subsequently, integrating a lightweight EA module with the Transformer framework, we have developed the WETformer model for predicting load sequences. This model is designed to analyze the temporal coupling and correlations between sub-sequences, as well as their trend variations.
The cloud load prediction model is designed to predict future resource usage and trends by analyzing historical cloud data. Let \(X_{in}^{s,r}(t)=\left[x_{1}^{s,r},x_{2}^{s,r},\ldots ,x_{T}^{s,r}\right]\) represent the historical load sequence, where T is the length of the sequence, \(s\in \{s_{1},s_{2},\ldots ,s_{n}\}\) denotes the server index in the cloud computing cluster, and \(r\in \{CPU,MEM,IO,DISK,\ldots \}\) indicates the type of resource for each server. Define \(X_{out}^{s,r}(t)=\left[x_{T+1}^{s,r},x_{T+2}^{s,r},\ldots ,x_{T+\lambda }^{s,r}\right]\) as the future load sequence that the model needs to predict, with \(\lambda\) being the prediction horizon. The problem of load sequence prediction can then be formalized as follows:
The important symbols and their explanations used in this article are shown in Table 2.
Architecture of WETformer
The core idea of WETformer is to develop a lightweight and high-precision solution for cloud platform load prediction, as depicted in the overall architecture shown in Fig. 1.
The WETformer model, comprising an encoder and a decoder, integrates several wavelet decomposition modules and EA modules. The primary task of the encoder is to model the detailed information within the load sequence. It retains the high-frequency components of the load sequence and eliminates the low-frequency components through the wavelet decomposition modules, while utilizing the EA modules to extract detailed sequence information. In the decoder section, the wavelet decomposition module continuously separates the detail and trend information of the load sequence and extracts the trend component from the intermediate variables. Subsequently, the decoder fuses the feature information extracted by the encoder to produce the prediction outcomes.
Encoder
Wavelet decomposition module
The wavelet decomposition module employs wavelet transform techniques to decompose a sequence into its high-frequency and low-frequency components. Wavelet analysis [41] is a time-frequency analysis method for signals, capable of characterizing the local features of a signal in both the time and frequency domains. It can project signals onto different frequency spaces. The low-frequency component contains the trend characteristics of the sequence and offers high-frequency resolution, while the high-frequency component encapsulates the detailed features of the sequence with high temporal resolution. This mechanism inherently addresses the issue of aliasing among multiple potential factors within the signal. The formula for the discrete wavelet transform is as follows:
where \(\psi (\cdot )\) represents the mother wavelet, and \(\psi _{j,k}(t)\) represents the wavelet transform function. The discrete wavelet transform is a collection of mother wavelet scaling and translation, and j and k control the scale of scaling and translation. Using discrete wavelet transform to perform multi-scale wavelet transform on cloud load data can enhance the temporal connection of load data. Perform discrete wavelet transform on the sequence X(t), and the transformation coefficient is obtained by the following formula:
where \(\overline{\psi _{j,k}(t)}\) is the conjugate function of \(\psi _{j,k}(t)\).
As depicted in Fig. 2, the original time series X(t) is subject to wavelet decomposition, resulting in low-frequency coefficients, denoted as \(ca_i\), and high-frequency coefficients, denoted as \(cd_i\). The low-frequency component can be recursively decomposed into \(ca_{i+1}\) and \(cd_{i+1}\), with this decomposition process being iteratively applied. Low-frequency and high-frequency coefficients, when combined with wavelet transform functions, can be singly reconstructed into low-frequency and high-frequency signal components, as illustrated in the right panel of Fig. 2. Additionally, the last layer of low-frequency components is fused with all high-frequency components to reconstruct the original signal. The wavelet reconstruction formula is as follows:
where \(d_n\) is the high-frequency signal component reconstructed by a single branch of the nth layer decomposition coefficient, and \(a_n\) is the low-frequency signal component reconstructed by a single branch of the nth layer decomposition coefficient. The wavelet decomposition module of the WETformer performs n levels decomposition of the signal, followed by a reconstruction process that generates n levels of detail (high-frequency) components and one level of trend (low-frequency) component. The high-frequency components across the n levels are then fused. Within this work, \(D,\textrm{A}=\widetilde{WD}(X(t))\) is used to represent the wavelet decomposition module. Among them, \(D=d_1+d_2+\cdots +d_n\) represents the high-frequency fusion component, and \(A=a_n\) represents the low-frequency component.
External attention mechanism
The self-attention mechanism in Transformers employs a linear combination of self-values to refine input features. Nonetheless, self-attention solely considers the relationships between elements within a sequence, neglecting the potential intrinsic associations across different sequences (such as CPU and memory utilization rates). Moreover, the quadratic computational complexity of the self-attention mechanism incurs additional computational costs on cloud platforms, which constrains its practical application in real-world cloud environments. The EA mechanism [42] has been developed to optimize the computational complexity of self-attention through the use of two external memory units, demonstrating promising results in the field of computer vision. Inspired by this approach, we have applied the EA to develop a load sequence prediction model, which enables the model to accurately predict cloud loads with a lower computational cost. The comparison between the structures of Self-attention and EA is illustrated in Fig. 3.
Figure 3a illustrates the workflow of the self-attention mechanism. An input vector undergoes three distinct linear transformations to produce the Query (Q), Key (K), and Value (V) matrices. The dot product of the Q and K matrices yields an attention weight matrix, which indicates the inter-element correlations within the sequence. This matrix is then normalized and multiplied with the V matrix, resulting in an output incorporating contextual information. The external attention mechanism, depicted in Fig. 3b, enhances the self-attention mechanism with two shared memory units: the key memory unit \(\mathcal {M}_k\) and the value memory unit \(\mathcal {M}_v\). These interact with the input features to determine attention weights, offering a design that not only reduces computational complexity to linear but also accounts for global sample correlations across the dataset.
The EA module first linearly projects the input feature \(\tilde{X}\) onto the query subspace Q, as shown in Eq. 5:
where \(W_q\) is the weight matrix.
Different from the self-attention mechanism, EA leverages external memory units to preserve globally shared weights, enabling the extraction of dependencies between different sequences within a dataset. EA computes the correlation between the self-query vector and the external memory unit \(\mathcal {M}_{k}\) to generate the attention map. Subsequently, the attention map is multiplied with the external memory unit \(\mathcal {M}_{v}\) to derive a new feature map, as illustrated in Eqs. 6 and 7:
where \(\widehat{\Upsilon }\) represents the attention map learned from the input data, \(\gamma _{i,j}\) is the similarity between the ith feature of Q and the jth row of \(\mathcal {M}_{k}\). Here, \(\mathcal {M}_{k}\) and \(\mathcal {M}_{v}\) are two learnable matrices that respectively play the role as the Query and Value matrices in EA. Then, the input features of \(\mathcal {M}_k\) and \(\mathcal {M}_{v}\) are updated based on the similarity in \(\widehat{\Upsilon }\). \(Norm(\cdot )\) is a double normalization method [42], including the Softmax function and the layer normalization function. The calculation process is as follows:
As illustrated in Fig. 4, the computational complexity of the self-attention mechanism is \(O(T^2\alpha )\), whereas the complexity of EA is \(O(T\alpha ^2)\). Typically, the number of input features T far exceeds the channel dimension \(\alpha\). Moreover, during the global weighting stage, EA maintains only two external memory units instead of generating key and value vectors for each input feature, which significantly reduces computational overhead. Additionally, since the two memory units are independent of the input features, the weights of \(M_k\) and \(M_v\) can be shared across different inputs. This approach preserves the temporal dependencies within load sequences while associating the influences between different sequences.
As depicted in Fig. 5, the multi-head external attention mechanism involves transforming input features into multiple Q matrices, with the count of matrices referred to as the “number of heads”. Each attention head independently learns different subspace representations of the input features, aiming to capture the diverse information of the input. Specifically, for each head, the similarity between the Q matrix and the shared Mk matrix is initially computed. These similarity measures are normalized to generate the corresponding attention weight distribution maps. Following this, the attention maps of each head are interactively multiplied with the shared \(\mathcal {M}_v\) matrix to yield the output representations for each head. Finally, integrating the output representations from all attention heads results in a comprehensive output vector.
The WETformer employs a multi-head EA mechanism to capture diverse relationships between representations across different input channels. By utilizing shared memory units \(\mathcal {M}_{k}\) and \(\mathcal {M}_{v}\), it facilitates information interaction across all channels, which serves to reduce the number of model parameters while enhancing performance. The formula for the multi-head EA is as follows:
where \(h_i\) is the output of the ith head, H is the number of heads, and \(W_{\textbf{0}}\) is a linear transformation matrix that makes the input and output dimensions consistent. \(\mathcal {M}_k\) and \(\mathcal {M}_v\) are shared external memory units for different heads. As depicted in Fig. 5, these attention values can be cascaded and projected to the final output of the multi-head attention mechanism.
To prevent network degradation, a residual connection is established between the input embedded features \(\tilde{X}\) and the features \(\tilde{R}\) extracted by the EA mechanism. Residual connections enable the original input to bypass specific layers and directly contribute to the output of subsequent layers, effectively mitigating the issue of vanishing gradients. This process is formalized in Eq. 13:
The load sequence processed by the EA mechanism is subjected to in-depth nonlinear feature learning through a two-layer feedforward network structure. This feedforward network consists of two linear layers, equipped with rectified linear units (ReLU) as the activation function. The ReLU activation function introduces non-linearity by keeping only the positive values, aiding the network in capturing and representing more intricate features. Subsequently, the network adds the input to the output via a residual connection to reinforce the flow of information during the learning process. Therefore, after being processed by the multi-head EA mechanism, the resulting output can be represented by Eq. 14.
where \(\omega _1\) and \(\omega _2\) represent the weight parameters of the linear layer, \(\beta _1\) and \(\beta _2\) represent the bias term parameters, and the ReLU activation function is used to introduce nonlinear characteristics.
Encoder input/output
In the WETformer model architecture depicted in Fig. 1, the encoder primarily extracts high-frequency feature information from the load sequence. The input to the encoder comprises the past T time steps of the sequence, denoted as \(X(t)\in \mathbb {R}^{T\times \theta }\), where \(\theta\) represents the dimension of the sequence. Since the model does not inherently account for the positional information of the elements within the sequence, which is crucial for capturing dependencies between elements, it is necessary to incorporate positional information encoding into the input sequence X(t). This positional encoding is represented as \(U_{pos}\in \mathbb {R}^{T\times \theta }=\{u_{1},u_{2},\ldots ,u_{T}\}\). Consequently, the input representation for the nth layer of the encoder is given by:
where \(X_{en}^{n-1}\) is the input vector containing position information.
The output of the encoder is the processed hidden state, which encapsulates the representation of high-frequency feature information from the historical load and is employed in the decoder’s predictive task. The computation process of the encoder layer is as described in Eqs. 16 and 17:
where “_” is the low-frequency component eliminated by the wavelet decomposition module in the encoder. \(D_{en}^{n,2}\) represents the output of the nth layer of the encoder, \(n\in \{1,2,...,N\}\), and N represents the total number of layers of the encoder. It is a vector that contains detailed feature information of the load sequence. \(D_{en}^{n,k}\) represents the high-frequency component output by the kth wavelet decomposition module of the nth layer, \(k\in \{1,2\}\).
To maintain dimensional consistency between the encoder and decoder, the encoder’s output features are linearly projected into a \((T+L)\)-dimensional vector space, which corresponds to the dimension of the decoder’s intermediate vectors. The interaction information from the encoder to the decoder can then be represented as follows:
where \(\widehat{W}\in \mathbb {R}^{(T+L)\times \theta }\) is a learnable transformation matrix.
Decoder
The input layer of the WETformer decoder is subdivided into two parts. The first part, denoted as \(\widetilde{A}_{de}\in \mathbb {R}^{(T+L)\times \theta }\), consists of a sequence formed by the low-frequency component \(\widetilde{A}\) of the load sequence and a placeholder of length L. The second part, denoted as \(\widetilde{D}_{de}\in \mathbb {R}^{(T+L)\times \theta }\), is composed of a sequence of the high-frequency component \(\widetilde{D}\) and a placeholder of the same length. The values of these placeholders are determined by the average of the corresponding low-frequency and high-frequency components, as shown in Eqs. 19 - 21:
where \(\widetilde{D},\widetilde{A}\in \mathbb {R}^{T\times \theta }\) respectively represent the high-frequency component and low-frequency component of X(t), \(D_{avg},A_{avg}\in \mathbb {R}^{L\times \theta }\) denote the placeholders for the decoder input, and Concat is the splicing operation. Add position information encoding \(U_{pos}\in \mathbb {R}^{(T+L)\times \theta }=\{u_1,u_2,...,u_{T+L}\}\) to the decoder, the input representation for the mth layer of the decoder is expressed as:
The decoder architecture consists of two components: an aggregation structure for low-frequency components and a stacked EA mechanism for high-frequency components (see Fig. 1). Each decoder layer comprises an EA module and an encoder-decoder information interaction module. The encoder-decoder information interaction module integrates the high-frequency detail information of historical loads, while the EA module is employed to refine predictive information. The decoder of the WETformer is capable of uncovering underlying trends from intermediate hidden variables, enabling the model to progressively refine predictions of load trends and filter out irrelevant information. Consequently, the decoder can extract the dependencies among detailed information within the EA module. Assuming there are M decoder layers, the decoder can be formalized as follows:
where \(D_{de}^{m,h},A_{de}^{m,h},h\in \{1,2,3\}\) respectively represent the high-frequency component and low-frequency component output by the hth wavelet decomposition module of the mth layer. \(D_{de}^{m,3}\), \(m\in \{1,2,\dots ,M\}\) represents the output of the high-frequency branch of the mth layer decoder, and \(A_{de}^m\) represents the output of the low-frequency branch of the mth layer decoder. \(W_{m,h}\) is the weight matrix used to project the low-frequency information \(A_{de}^{m,h}\) extracted at the hth iteration.
The purpose of the prediction layer is to aggregate the extracted deep features to accomplish the final load prediction. The output of the prediction layer is the sum of the two decomposed components, which can be represented as follows:
where the parameter \(W_D\) is used to project the high-frequency component \(D_{de}^M\) of the depth transformation to the target dimension.
Experiment and result analysis
In this section, we conduct our study by selecting three distinct datasets of varying scales and resource configurations from the public repository of Alibaba Cloud [43]. This choice aims to adapt our method to diverse scenarios and cater to the prediction requirements of different data types. To evaluate the efficacy of our approach, we compare it with six state-of-the-art time series prediction algorithms, focusing on two primary metrics: prediction accuracy and inference time. Subsequently, we perform ablation experiments to investigate the significance and necessity of each module within our proposed model.
Experimental environment and dataset
The method proposed in this paper is implemented in PyTorch 1.7.1 and Python 3.7. The associated experiments were conducted under the hardware environment configuration detailed in Table 3.
The experiment uses three data sets published by Alibaba Cloud. The specific information is as follows:
(1) cluster-trace-v2017: This dataset includes performance monitoring data spanning 12 hours for around 1,300 computers that concurrently execute online services and batch jobs.
(2) cluster-trace-v2018: The dataset offers insights into resource utilization across 8 days for an estimated 4,000 machines.
(3) cluster-trace-gpu-v2020: This dataset contains operational data for a GPU server cluster comprising over 6,500 GPUs across 1,800 servers, spanning a period of 2 months.
Due to the differences in scenarios and types of data collected, we selected different prediction objectives and server parameters for each dataset. For the cluster-trace-v2017, we chose the CPU utilization, memory usage, and 5-minute system load (indicating the number of tasks executed within 5 minutes) from 600 servers as the predictive targets. Each server provided 144 observation records, with a data sampling interval of 5 minutes. We set the prediction time window to 36 intervals to test the model’s effectiveness in short-term prediction. For the cluster-trace-v2018, we selected the CPU utilization, memory usage, and disk I/O usage from 2000 machines as the predictive targets. Each machine contained approximately 6900 records. We set the prediction length to 720 intervals to evaluate the model’s performance in long-term prediction. For the cluster-trace-gpu-v2020, we chose the CPU utilization, GPU usage, and 1-minute system load (referring to the number of tasks executed within 1 minute) from 500 machines with a substantial number of observations as the predictive targets. We set the prediction length to 500 intervals to test the model’s forecasting capability for medium to long-term time series. During the dataset processing stage, we first removed invalid values and then grouped the data by machine ID. To eliminate the impact of different dimensions, we unified the sampling interval to 5 minutes and normalized the data.
Evaluating indicator
To evaluate the effectiveness of the proposed method, we utilized four evaluation metrics: Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and Coefficient of Determination (\(\text {R}^2\)).
RMSE is a metric that calculates the square of the differences between the actual and predicted values, sums these squares, and then takes the square root of the result. It serves as a primary indicator of the discrepancy between the predicted and actual values. The formula for calculating RMSE is presented in Eq. 29. A smaller RMSE value indicates a higher accuracy of the model.
where \(\lambda\) represents the length of the test sample, \(y_t\) is the observed value, \(\hat{y}_{t}\) is the predicted value.
MAE is the average of the absolute differences between the predicted and actual values. It mainly reflects the actual situation of the prediction error. The calculation formula for MAE is as follows.
MAPE is a relative error metric that uses absolute values to avoid positive and negative errors canceling each other out. The closer its value is to 0, the smaller the prediction error of the model. The calculation formula is as follows.
The \(\text {R}^2\) reflects the degree to which the independent variable explains the variation in the dependent variable. The closer its value is to 1, the better the model fit and the higher the accuracy. The calculation formula is as follows.
where \(\bar{y}_{t}\) is the average of the observed values.
Parameter selection
In the training process of the WETformer model, the length of the input sequence and the number of wavelet decomposition layers are two key parameters. Time series prediction relies on historical data to capture long-term dependencies and patterns of the sequence. If the input data is too short, the model may not learn sufficient feature information from the historical data, which can affect the accuracy of its predictions. Conversely, if the input data is too long, the model may learn excessive noise and redundant information, leading to poor generalization and compromised prediction precision. Moreover, excessively long inputs can consume excessive computational resources and storage space, reducing execution efficiency. Additionally, the number of layers in wavelet decomposition affects the amount of valid information contained in the high-frequency and low-frequency components of the sequence. An appropriate number of decomposition layers can help the model better capture the characteristics of time series data, such as trends and periodicity. To accurately assess the impact of these parameters on the model’s predictive performance, this study systematically tested different input lengths and numbers of wavelet decomposition layers. The experiment set input sequence lengths at intervals of 10 minutes from 20 to 120 minutes and wavelet decompositions ranging from 1 to 8 layers. Subsequently, we quantify the specific impact of different input lengths and decomposition layers on the prediction results based on the average values of four evaluation metrics: RMSE, MAE, MAPE, and \(\text {R}^2\). By conducting a comparative analysis of these metrics, we determine the optimal parameter configuration for the WETformer.
Figure 6 presents the heatmap visualizations, both planar and curved surfaces, of the average RMSE, MAE, MAPE, and \(\text {R}^2\) scores for the WETformer on the cluster-trace-v2018 at varying input lengths and wavelet decomposition levels. In these heatmaps, the height of the surface corresponds to the numerical values and colors in the planar graph. Figure 6a, b, and c reveal that at the input length of 20, the model acquires minimal observational data, resulting in substantial prediction errors. As the input length extends, the model’s prediction error diminishes, and the fit is enhanced, peaking in optimal performance at an input length of 60. However, beyond 60, the model’s prediction error begins to climb, and the fit worsens. As shown in Fig. 6(d), with a wavelet decomposition level of one, the data is segregated into only one low-frequency and one high-frequency component, yielding an inadequate fit. Increasing the number of wavelet decomposition levels reduces the prediction error and enhances the model fit, with the best performance achieved at four decomposition levels. Therefore, for the cluster-trace-v2018 dataset, we selected an input length of 60 and four levels of wavelet decomposition for prediction. We utilized the same methodology to determine the optimal parameter configurations for the cluster-trace-v2017 and cluster-trace-v2020 datasets. The experimental results indicate that the optimal input lengths for these datasets are 30 and 40, respectively, with both benefiting from four levels of wavelet decomposition. The remaining parameter settings are as shown in Table 4.
Comparison and analysis of prediction accuracy
In this study, we randomly extracted server performance monitoring data from various datasets and presented selected predictive results. Figure 7 illustrates the efficacy of the WETformer in predicting short sequences of CPU, memory, and 5-minute load on the cluster-trace-v2017 dataset. The figure compares the model’s predictions and actual observation sequences, demonstrating a high correspondence between the predicted curves by the WETformer and the observed data. Figure 8 showcases the model’s predictive details on the cluster-trace-v2018 dataset, which includes long sequences with significant fluctuations and sudden peaks, encompassing CPU, memory, and disk IO. As seen from Fig. 8, the WETformer exhibits a high degree of accuracy in predicting long sequences with notable fluctuation patterns, closely aligning with the actual sequence in both overall trends and specific details. Figure 9 presents the model’s precision in prediction on the cluster-trace-v2020 dataset. The resource sequence data in this dataset contains a large amount of trend information, while the detailed features are relatively few, only containing a small amount of low-amplitude high-frequency fluctuations. As shown in Fig. 9, WETformer achieves a high fitting degree in medium and long-term prediction tasks for server CPU, GPU, and 1-minute load.
To rigorously evaluate the performance of the WETformer model across different metrics, we benchmarked it against leading algorithms such as LSTM [44], GRU [45], TCN [46], Transformer [47], Autoformer [48], Informer [49], VRAM [34], and WA-BiGRU [35]. To ensure a fair comparison, we tested all models under the same hardware setup and optimized their hyperparameters. During the experimental phase, we tested each model multiple times and statistically analyzed their prediction results for key performance indicators such as CPU usage, GPU usage, memory usage, disk I/O, and system load. We then calculated the average performance for each model. The specific experimental results are presented in Tables 5, 6 and 7.
Tables 5, 6 and 7 exhibits the predictive outcomes of the WETformer and baseline models across nine performance metrics on three datasets. Although the models show variations in their predictions for different load sequences, the WETformer, in general, exhibits better performance than the baseline models in most prediction scenarios. As shown in Table 5, the TCN model’s performance on the cluster-trace-v2017 dataset is lackluster. This is due to the large number of parameters that the TCN must update during training. With a limited number of samples in the dataset, the model struggles to learn effectively, leading to significant prediction errors. In contrast, the lightweight models-Autoformer, Informer, and WETformer, which have fewer parameters-exhibit commendable predictive performance, with the prediction errors of the three models being similar and their \(\text {R}^2\) values relatively high.
Table 6 presents the predictive outcomes of various models on the cluster-trace-v-2018 dataset. LSTM and its variants, which process time series data recursively, are prone to accumulating prediction errors over time with the increase in sequence length, leading to suboptimal prediction performance. In contrast, models equipped with global modeling capabilities, such as Transformer, Informer, Autoformer, VRAM, WA-BiGRU and WETformer, exhibit significant performance advantages over LSTM and its derivatives. As demonstrated by the experimental results in Table 6, for different metrics, WETformer achieves the best results in predicting memory and I/O utilization compared to baseline algorithms. Informer, on the other hand, yields the most optimal values in prediction CPU utilization.
Table 7 presents the predictive outcomes of various models on the CPU, GPU, and Load-1 metrics within the cluster-trace-v2020 dataset. WETformer maintains a significant advantage over LSTM, GRU, and TCN in terms of RMSE, MAE, MAPE, and \(\text {R}^2\). Compared to LSTM, WETformer exhibits reductions in average RMSE, MAE, and MAPE of 0.391, 0.366, and 3.037, respectively, and an increase in \(\text {R}^2\) of 0.458. Although the improvement in prediction accuracy of WETformer over Informer is not remarkable, it still offers enhancements of 0.0297 and 0.0393 in RMSE and MAE, respectively. VRAM, WA-BiGRU, and the WETformer all integrate decomposition technology and attention mechanisms. This approach markedly improves the models’ capability to grasp the trends and nuances in load sequences, leading to higher prediction accuracy. Nonetheless, the attention mechanisms employed in VRAM and WA-BiGRU are confined to individual input instances, overlooking the potential inter-sequence correlations. Consequently, their predictive precision is inferior to that of the WETformer.
Efficiency analysis
To compare the prediction efficiency of the models, we measured the time overhead for each model when performing prediction tasks on three datasets.
Figure 10 illustrates a comparison of the inference time between the WETformer and the baseline algorithms. The results indicate that the WETformer exhibits advantages over baseline algorithms in terms of prediction time, with TCN having the longest average time. TCN achieves a larger receptive field by stacking multiple convolutional layers to capture long-distance temporal dependencies. While stacking additional layers improves prediction accuracy, it also increases the number of model parameters and prediction time. Due to the high computational complexity of the multi-head self-attention mechanism, the Transformer has a longer average time, reaching 13.26ms. WA-BiGRU and VRAM employ versions of recurrent networks that are integrated with attention mechanisms, which involve a considerable number of parameters. Their average inference times are 11.21ms and 10.72ms, respectively. In contrast, WETformer employs low-redundancy EA, resulting in an average inference time of 4.85ms. Autoformer and Informer follow closely behind WETformer in terms of prediction time, with average times of 5.9138ms and 6.326ms, respectively. LSTM and GRU process only one-time step of the input sequence at a time, which increases the time required for predicting long sequences.
Ablation experiment
To validate the efficacy and importance of the wavelet decomposition module and the EA module within the model, adjustments were made to the base model architecture as follows: (1) replacement of the EA module with a self-attention module; (2) removal of the wavelet decomposition module; (3) removal of the EA module. Subsequently, comparative experiments were conducted on the modified models alongside the original model using the same datasets to elucidate the specific impact of each module on the model’s performance. To facilitate comparative analysis, we set the experimental results of the original model to 1. The mean results obtained from the three datasets are presented in Fig. 11.
Figure 11 illustrates that replacing an external attention mechanism with a self-attention mechanism did not significantly differ in prediction accuracy. The reason is that both mechanisms are capable of capturing global sequence information. However, the prediction time for models with self-attention mechanisms is significantly longer due to their quadratic computational complexity. Models equipped with self-attention exhibit an average prediction time that is 2.7 times longer than those utilizing external attention for the same three datasets.
Removing the external attention module led to a decline in the model’s performance across the three datasets. Compared to the original model, the average RMSE, MAE, and MAPE have increased by 0.79, 0.99, and 1.04, respectively, while the \(\text {R}^2\) has decreased by 0.26. The primary cause for this regression is the external attention mechanism’s capability to harness global information, which is crucial for emphasizing significant detail features. The removal of this module consequently diminishes the model’s proficiency in capturing long-sequence dependency relationships.
Moreover, with the exclusion of the wavelet decomposition module, the model now solely utilizes external attention mechanisms to capture the characteristics of the load sequence at a macro level. This methodology falls short in effectively delineating between the trend and detail components within the sequence. As a result, the model exhibits a deficiency in interpreting complex temporal patterns, leading to a rise in the average RMSE, MAE, and MAPE of the predictive outcomes by 0.75, 0.94, and 0.95, respectively, while diminishing the average \(\text {R}^2\) by 0.22. Overall, the wavelet decomposition module and the external attention module play a crucial role in improving the prediction accuracy and efficiency of the WETformer.
Conclusion
Cloud load prediction models play a crucial role in resource allocation and scheduling in cloud data centers. By predicting the resource usage of cloud servers, these models provide a powerful tool for understanding and analyzing load variation trends, thereby promoting optimal resource planning, ensuring efficient resource utilization, and avoiding resource shortages or excesses. In this study, we propose a novel cloud load prediction model, WETformer, based on wavelet decomposition and EA mechanisms. The model utilizes a wavelet decomposition module to process sequences, separating high-frequency and low-frequency load characteristics. An EA mechanism is applied to extract feature information from the high-frequency components, and an encoder-decoder framework continuously accumulates and integrates these features, resulting in encoded representations of multiple related feature information. In experimental evaluations, the WETformer exhibits superior performance compared to several state-of-the-art load prediction models, providing stable and significant improvements in prediction capabilities. The average RMSE, MAE, MAPE, and \(\text {R}^2\) scores on three datasets were 0.459, 0.363, 3.216%, and 0.977, respectively. In terms of prediction efficiency, the lightweight module setup of the WETformer significantly saves inference time. Compared to baseline models, the average inference time was improved by 48.43%, 39.41%, 66.29%, 61.70%, 24.75%, 29.66%, 59.21%, and 58.36% respectively. In summary, WETformer can serve as an effective cloud load prediction tool, providing services for cloud computing platform management.
In addition, we have conducted a detailed analysis of the experimental data where the prediction accuracy of the WETformer model is lower than that of the baseline. The study found that due to the characteristic differences in different resource load sequences, using a single fixed number of wavelet decomposition layers cannot achieve the best performance for predicting all resource sequences simultaneously. Therefore, in future research, we will optimize and improve the wavelet decomposition module of WETformer. Specifically, we will design an adaptive wavelet decomposition strategy, which will dynamically determine the most effective level of decomposition for each stage based on the sequence characteristics. This strategy aims to maintain sufficient decomposition granularity while avoiding the computation overhead caused by over-decomposition. Besides, we will develop an adaptive wavelet reconstruction method to retain components related to the prediction target and remove irrelevant components. This method will help to reduce the number of parameters in the model, thereby streamlining the computational process. Meanwhile, it will replace the step of configuring wavelet parameters, enhancing the model’s generalization capabilities.
Availability of data and materials
The Alibaba cluster-trace dataset used in this paper is publicly available on GitHub, and the link is provided in the paper.
Data availability
No datasets were generated or analysed during the current study.
References
Xu M, Song C, Wu H, Gill SS, Ye K, Xu C (2022) esDNN: deep neural network based multivariate workload prediction in cloud computing environments. ACM Trans Internet Techn 22(3):75:1–75:24
Singh AK, Saxena D, Kumar J, Gupta V (2021) A quantum approach towards the adaptive prediction of cloud workloads. IEEE Trans Parallel Distributed Syst 32(12):2893–2905
Baig SUR, Iqbal W, Berral JL, Erradi A, Carrera D (2019) Adaptive prediction models for data center resources utilization estimation. IEEE Trans Netw Serv Manag 16(4):1681–1693
Saxena D, Kumar J, Singh AK, Schmid S (2023) Performance analysis of machine learning centered workload prediction models for cloud. IEEE Trans Parallel Distributed Syst 34(4):1313–1330
Kumar J, Singh AK, Buyya R (2021) Self directed learning based workload forecasting model for cloud resource management. Inf Sci 543:345–366
Chen Z, Hu J, Min G, Zomaya AY, El-Ghazawi TA (2020) Towards accurate prediction for high-dimensional and highly-variable cloud workloads with deep learning. IEEE Trans Parallel Distributed Syst 31(4):923–934
Xiao Y, Yao Y, Chen K, Tang W, Zhu F (2022) A simulation task partition method based on cloud computing resource prediction using ensemble learning. Simul Model Pract Theory 119:102595
Liu C, Liu C, Shang Y, Chen S, Cheng B, Chen J (2017) An adaptive prediction approach based on workload pattern discrimination in the cloud. J Netw Comput Appl 80:35–44
Zhang Y, Liu F, Wang B, Lin W, Zhong G, Xu M, Li K (2022) A multi-output prediction model for physical machine resource usage in cloud data centers. Future Gener Comput Syst 130:292–306
Kaur G, Bala A, Chana I (2019) An intelligent regressive ensemble approach for predicting resource usage in cloud computing. J Parallel Distributed Comput 123:1–12
Kholidy HA (2020) An intelligent swarm based prediction approach for predicting cloud computing user resource needs. Comput Commun 151:133–144
Jayakumar VK, Arbat S, Kim IK, Wang W (2022) Cloudbruno: a low-overhead online workload prediction framework for cloud computing. In: IEEE International Conference on Cloud Engineering, IC2E 2022, Pacific Grove, CA, USA, September 26-30, 2022, IEEE, pp 188–198
Yoon MS, Kamal AE, Zhu Z (2017) Adaptive data center activation with user request prediction. Comput Networks 122:191–204
Nguyen HM, Kalra G, Kim D (2019) Host load prediction in cloud computing using long short-term memory encoder-decoder. J Supercomput 75(11):7592–7605
Zhong W, Zhuang Y, Sun J, Gu J (2018) A load prediction model for cloud computing using pso-based weighted wavelet support vector machine. Appl Intell 48(11):4072–4083
Song B, Yu Y, Zhou Y, Wang Z, Du S (2018) Host load prediction with long short-term memory in cloud computing. J Supercomput 74(12):6554–6568
Gupta S, Dileep AD, Gonsalves TA (2020) Online sparse BLSTM models for resource usage prediction in cloud datacentres. IEEE Trans Netw Serv Manag 17(4):2335–2349
Ouhame S, Hadi Y, Ullah A (2021) An efficient forecasting approach for resource utilization in cloud data center using CNN-LSTM model. Neural Comput Appl 33(16):10043–10055
Valarmathi K, Raja SKS (2021) Resource utilization prediction technique in cloud using knowledge based ensemble random forest with LSTM model. Concurr Eng Res Appl 29(4):396–404
Zerveas G, Jayaraman S, Patel D, Bhamidipaty A, Eickhoff C (2021) A transformer-based framework for multivariate time series representation learning. KDD. ACM, Singapore, pp 2114–2124
Zhang Y, Yan J (2023) Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. In: ICLR, OpenReview.net, Kigali
Liu Y, Wu H, Wang J, Long M (2022) Non-stationary transformers: Exploring the stationarity in time series forecasting. In: NeurIPS, New Orleans
Zhang J, Duan H, Guo L, Xu L, Wang X (2023) Towards lightweight cross-domain sequential recommendation via external attention-enhanced graph convolution network. In: DASFAA. Lecture Notes in Computer Science, vol 13944. Springer, Tianjin, pp 205–220
Mulia WD, Sehgal N, Sohoni S, Acken JM, Stanberry CL, Fritz DJ (2013) Cloud workload characterization. IETE Tech Rev 30(5):382–397
Zeng A, Chen M, Zhang L, Xu Q (2023) Are transformers effective for time series forecasting? AAAI. AAAI Press, Washington, pp 11121–11128
Gupta S, Dinesh DA (2017) Online adaptation models for resource usage prediction in cloud network. NCC. IEEE, Chennai, pp 1–6
Zhong W, Zhuang Y, Sun J, Gu J (2017) The cloud computing load forecasting algorithm based on wavelet support vector machine. In: ACSW, ACM, Geelong, pp 38:1–38:5
Kumar J, Singh AK (2018) Workload prediction in cloud using artificial neural network and adaptive differential evolution. Future Gener Comput Syst 81:41–52
Rahmanian AA, Ghobaei-Arani M, Tofighy S (2018) A learning automata-based ensemble resource usage prediction algorithm for cloud computing environment. Future Gener Comput Syst 79:54–71
Gao J, Wang H, Shen H (2020) Machine learning based workload prediction in cloud computing. ICCCN. IEEE, Honolulu, pp 1–9
Cheng Y, Wang C, Yu H, Hu Y, Zhou X (2019) GRU-ES: resource usage prediction of cloud workloads using a novel hybrid method. IEEE. IEEE, Zhangjiajie, pp 1249–1256
Zhu Y, Zhang W, Chen Y, Gao H (2019) A novel approach to workload prediction using attention-based LSTM encoder-decoder network in cloud environment. EURASIP J Wirel Commun Netw 2019:274
Bi J, Li S, Yuan H, Zhou M (2021) Integrated deep learning method for workload and resource prediction in cloud systems. Neurocomputing 424:35–48
Predić B, Jovanovic L, Simic V, Bacanin N, Zivkovic M, Spalevic P, Budimirovic N, Dobrojevic M (2024) Cloud-load forecasting via decomposition-aided attention recurrent neural network tuned by modified particle swarm optimization. Complex Intell Syst 10(2):2249–2269
Dogani J, Khunjush F, Seydali M (2023) Host load prediction in cloud computing with discrete wavelet transformation (DWT) and bidirectional gated recurrent unit (bigru) network. Comput Commun 198:157–174
Qi W, Yao J, Li J, Wu W (2022) Performer: A resource demand forecasting method for data centers. In: GPC. Lecture Notes in Computer Science, vol 13744. Springer, Chengdu, pp 204–214
Wu T, Pan M, Yu Y (2022) A long-term cloud workload prediction framework for reserved resource allocation. IEEE. IEEE, Barcelona, Spain, pp 134–139
Zhu J, Bai W, Zhao J, Zuo L, Zhou T, Li K (2023) Variational mode decomposition and sample entropy optimization based transformer framework for cloud resource load prediction. Knowl Based Syst 280:111042
Arbat S, Jayakumar VK, Lee J, Wang W, Kim IK (2022) Wasserstein adversarial transformer for cloud workload prediction. AAAI. AAAI Press, Virtual, pp 12433–12439
Zhang Q, Yang LT, Yan Z, Chen Z, Li P (2018) An efficient deep learning model to predict cloud workload for industry informatics. IEEE Trans Ind Inform 14(7):3170–3178
Rhif M, Ben Abbes A, Farah IR, Martínez B, Sang Y (2019) Wavelet transform application for/in non-stationary time-series analysis: A review. Appl Sci 9(7):1345
Guo M, Liu Z, Mu T, Hu S (2023) Beyond self-attention: External attention using two linear layers for visual tasks. IEEE Trans Pattern Anal Mach Intell 45(5):5436–5447
Alibaba (2020) Cluster data collected from production clusters in alibaba for cluster management research. https://github.com/alibaba/clusterdata. Accessed Dec 2023
Hu J, Wang X, Zhang Y, Zhang D, Zhang M, Xue J (2020) Time series prediction method based on variant LSTM recurrent neural network. Neural Process Lett 52(2):1485–1500
Shu W, Zeng F, Ling Z, Liu J, Lu T, Chen G (2021) Resource demand prediction of cloud workloads using an attention-based GRU model. MSN. IEEE, Exeter, pp 428–437
Bai S, Kolter JZ, Koltun V (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. CoRR . arXiv:1803.01271
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30: 5998–6008
Wu H, Xu J, Wang J, Long M (2021) Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. In: NeurIPS, virtual. MIT Press, San Diego, pp 22419–22430
Zhou H, Zhang S, Peng J, Zhang S, Li J, Xiong H, Zhang W (2021) Informer: Beyond efficient transformer for long sequence time-series forecasting. AAAI. AAAI Press, Virtual, pp 11106–11115
Funding
This research was supported by the National Key Research and Development Program (2018YFC1406200).
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Conceptualization by Zhen Zhang and Shaohua Xu. Data collection and preprocessing were performed by Jinyu Zhang and Zhe Zhu. Visualization was completed by Jinyu Zhang. The first draft of this manuscript was written by Zhen Zhang and Chen Xu. The authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Consent has been granted by all authors and there is no conflict.
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhang, Z., Xu, C., Zhang, J. et al. When wavelet decomposition meets external attention: a lightweight cloud server load prediction model. J Cloud Comp 13, 133 (2024). https://doi.org/10.1186/s13677-024-00698-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13677-024-00698-6