Robust and accurate performance anomaly detection and prediction for cloud applications: a novel ensemble learning-based framework

Effectively detecting run-time performance anomalies is crucial for clouds to identify abnormal performance behavior and forestall future incidents. To be used for real-world applications, an effective anomaly detection framework should meet three main challenging requirements: high accuracy for identifying anomalies, good robustness when application patterns change, and prediction ability for upcoming anomalies. Unfortunately, existing research about performance anomaly detection usually focuses on improving detection accuracy, while little research tackles the three challenges simultaneously. We conduct experiments for existing detection methods on multiple application monitoring data, and results show that existing detection methods usually focus on different features in data, which will lead to their diverse performance on different data patterns. Therefore, existing anomaly detection methods have difficulty improving detection accuracy and robustness and predicting anomalies. To address the three requirements, we propose an Ensemble Learning-Based Detection (ELBD) framework which integrates existing well-selected detection methods. The framework includes three classic linear ensemble methods (maximum, average, and weighted average) and a novel deep ensemble method. Our experiments show that the ELBD framework realizes better detection accuracy and robustness, where the deep ensemble method can achieve the most accurate and robust detection for cloud applications. In addition, it can predict anomalies in the next four minutes with an F1 score higher than 0.8. The paper also proposes a new indicator ARP_score\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ARP\_score$$\end{document} to measure detection accuracy, robustness, and multi-step prediction ability. The ARP_score\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ARP\_score$$\end{document} of the deep ensemble method is 5.1821, which is much higher than other detection methods.


Introduction
The run-time status of cloud applications can be continuously monitored through system-related metrics, e.g., CPU and memory usage [1]. Performance anomaly detection plays a vital role in operating cloud services, and applications [2,3]. Cloud performance anomalies such as degraded response time, often caused by underlying system resource shortages, may severely affect the quality of an application's user experience (QoE) and service (QoS). Therefore, effectively analyzing patterns of monitoring system-related metrics and identifying abnormal performance in real-time is crucial for continuously delivering the business value of a cloud application. In this context, we can highlight three challenging requirements for a performance anomaly detection framework. First, the detection must achieve high accuracy to ensure anomalies can be found as accurately as possible. Second, detection algorithm robustness is essential. Different data distributions exist in multiple monitoring data, which requires a robust algorithm to meet changes in data patterns and maintain performance consistency. Finally, to prevent potential application violations effectively, it is vital to make a multi-step prediction of future anomalies. Existing anomaly detection methods have often been developed using statistics [4] or machine learning [5,6] based methods. Most methods focus on improving detection accuracy. For example, Audibert et al. [7] developed the USAD based on an adversely trained AutoEncoder and achieved the best detection accuracy. Studies on improving the robustness of detection methods usually use adversarial training, which needs to make a trade-off between robustness and accuracy [8], rather than simultaneously improving accuracy and robustness. In addition, research on anomaly prediction mainly focuses on onestep prediction [9], which has limited effect in preventing potential performance anomalies. Existing research explores different aspects of the three challenging requirements, but few studies simultaneously tackle the challenges of accuracy, robustness, and multi-step prediction ability. Besides, there are also no effective indicators to measure the combination of the three requirements.
Moreover, the development of performance anomaly detection has to handle two data challenges.
• Missing data labels. Most of the monitoring data does not contain labels that can be immediately used for training a machine learning-based model, and labeling time-series data is often manual and timeconsuming. • Data noise. Monitoring data collected from a distributed network often contain noises, which can significantly influence the performance of the anomaly detection methods and increase the false-positive detection.
Thus, for performance anomaly detection, we define our research question as "how to effectively detect and predict performance anomalies with high accuracy and good robustness?". To address the two data challenges, we focus on unsupervised and weakly supervised detection methods and provide feature extraction to filter noise in data. To answer our research question, we first explore existing unsupervised anomaly detection methods and observe their detection performance on different datasets. Then, to improve detection accuracy, robustness, and prediction ability, we develop an Ensemble Learning-Based Detection (ELBD) framework that incorporates classic detection methods rather than enhances a single model. The contributions in this paper mainly include: • We characterize four typical base detection methods on different datasets, and the results show that their detection performance is not good for detection accuracy, robustness, and prediction. • Based on base detection methods, we propose an ELBD framework including three classic linear ensemble methods (maximum, average, and weighted average) and a deep ensemble method. • We propose ARP_score to evaluate detection performance in terms of accuracy, robustness, and multistep prediction. • We evaluated the methods in the ELBD framework on different datasets, and the results show that the deep ensemble method achieves the highest ARP_score 5.1821.
The rest of the paper is organized as follows. In Related works section, we review existing performance anomaly detection methods, specifically ensemble learning. In Base performance anomaly detection methods section, we provide base detection methods and an evaluation of their performance. In Ensemble learning-based detection framework section, we propose the ELBD framework and evaluate detection accuracy, robustness, and prediction ability. Finally, discussion and conclusion are provided in Discussion and Conclusion and future work sections.

Related works
Performance anomaly detection is a process of detecting abnormal performance phenomena and predicting anomalies to forestall future incidents [10]. Research about performance anomaly detection is ongoing rapidly, and machine learning methods are widely applied [11]. This section will briefly review machine learning-based anomaly detection methods and specifically highlight ensemble learning.

Machine learning-based anomaly detection methods
Machine learning-based anomaly detection methods can be reviewed in terms of supervised, semi-supervised, and unsupervised learning. Supervised learning methods have high accuracy [5], but they are ineffective for application monitoring data because data labels are usually missing in reality and manually labeling data manually is time-consuming. Semi-supervised learning methods are developed when fewer labels exist, and unsupervised learning methods are used when no labels exist. Semisupervised methods typically outperform unsupervised methods, but unsupervised methods are better suited for actual industrial scenarios [12].
In Table 1, we provide a classification of unsupervised performance anomaly detection methods. The table includes traditional methods such as tree-based,  [23]. For example, Hashemi et al. [24] enhance the robustness of an intrusion detection system in the presence of adversarial examples by utilizing denoising autoencoders. However, there is usually a trade-off between model accuracy and robustness [8], which makes it a challenge to improve model robustness and accuracy simultaneously. In addition, research on anomaly prediction usually focuses on univariate data and one-step prediction. For example, Wu et al. [9] provide a prediction-driven anomaly detection method that relies on Long Short Term Memory (LSTM) with univariate time-series data.
In conclusion, machine learning methods, especially semi-supervised and unsupervised methods, can be considered for performance anomaly detection because fewer labels exist. While different methods usually focus on different data features, we can consider integrating existing methods, for example, LOF, KNN, OCSVM, and IForest, in Table 1. To improve detection accuracy and robustness simultaneously, ensemble learning instead of adversarial training to integrate existing detection methods can be considered. We will introduce related work to ensemble learning next. In addition, we provide a multistep prediction based on multi-variate metrics for performance anomalies in this paper.

Ensemble learning
Ensemble learning is proposed to improve the accuracy and reduce the variance of an automated decision-making system [25]. The primary assumption of ensemble learning is that by combining several base models, the errors of a single model will likely be compensated by other models [26]. For anomaly detection, the ensemble of anomaly scores by taking the maximum, and average actions can be found in [27]. Research about ensemble learning can be reviewed based on supervised classification, semisupervised and unsupervised clustering ensemble.
Some research already focuses on ensemble learning with machine learning methods. As for supervised ensemble learning, Tyralis et al. [28] propose an ensemble learning method by combining ten machine learning algorithms and estimating the weights through a k-fold cross-validation procedure. Tama et al. [29] propose a stacked ensemble, which uses three classifiers (random forest, gradient boosting machine, and XGBoost) and provides a generalized linear model (GLM) as a combiner. Adeyemo et al. [30] focus on network intrusion detection and implement two ensemble methods and a deep learning method (LSTM). The two ensemble methods include a homogeneous method that uses an optimized bagged random forest algorithm and a heterogeneous method that is an averaged probability method of a voting ensemble for four standard classifiers. These studies of ensemble learning mainly focus on weight calculations or linear combinations of different base models.
Semi-supervised ensemble learning mainly focuses on expanding the labeled training set and utilizing the expanded training set to do classification or regression [31]. For example, Jian et al. [32] present a sample information-based synthetic minority oversampling technique to balance the labeled dataset and use variable weighted voting for integrating base models. This research focuses on the data label issue with semi-supervised learning, but the ensemble is linear. Unsupervised ensemble learning, also known as consensus clustering, is to find the optimal combination strategy of individual clustering. Ensemble clustering can be classified into three categories, pairwise co-occurrence based methods [33], graph partitioning based methods [34] and median partition-based methods [35]. Unlu et al. [36] provide a weighting policy based on internal clustering quality measures, which gives different importance to individual clustering. This research provides a weighted policy to integrate individual models, but it works linearly. For ensemble learning, we can see that combining different methods is challenging, and many methods focus on linear combinations, which limits information extraction and detection performance improvement. Therefore, to get an effective model that improves detection accuracy and robustness and makes prediction of performance anomaly detection, the nonlinear combination of different base detection methods is further investigated in this paper.

Base performance anomaly detection methods
For performance anomaly detection of cloud applications, we provide feature extraction for original data and explore the performance of four base detection methods in this section.

Problem definition
Multivariate time-series data are timestamped data points sequences and can be represented as D. Then each data point will be D t i ( i = [1, ..., n] is the index of resource metrics. n is the number of resource metrics. t ∈ N * is the index of timestamps). Multivariate time-series data anomaly detection is to learn the characteristics of data D and determine whether an observation D n+1 is anomalous or not. For multi-step anomaly prediction, we will use data D for training, and determine whether D n+1 , D n+2 , ..., D n+p is anomalous.
In this paper, we first provide the performance of classic detection methods. Then we propose an ELBD framework, which is developed based on ensemble learning and aims to improve detection accuracy and robustness by integrating information extracted by classic detection methods non-linearly. In addition, we implement multistep prediction ability in the deep ensemble method in ELBD framework.

Feature extraction
Multivariate data usually contains noise, which can induce unnecessary variance in a model. Therefore, preprocessing data through feature extraction to remove redundant information and reduce data dimension is needed. For feature extraction, Principal Components Analysis (PCA) [37] is a classic and most used method. PCA is an unsupervised method that uses eigenvalue decomposition to compress and denoise data, which is suitable as the feature extraction method considering there are no labels or fewer labels in reality.
PCA is a method to transform a dataset with lots of variables into a smaller one containing most of the original information. The process steps of PCA are: 1) getting the covariance matrix of original features; 2) calculating eigenvectors and eigenvalues of the covariance matrix to identify principal components; 3) sorting eigenvalues and selecting eigenvectors with high eigenvalues as feature vectors; 4) recasting original data based on feature vectors. In step 3, the number of selected eigenvectors determines the data dimensions after reduction. In practice, we set the reduction dimension based on a calculated percentage of variance [38]. According to these calculations, PCA achieves principal feature selection and data dimension reduction. Finally, we apply PCA to all monitoring data and use the data with low dimensions as the input of anomaly detection methods.

Base detection methods
Different anomaly detection methods usually focus on different features in data, such as density-based and distancebased, and result in diverse performance on data. Therefore, to comprehensively understand the characteristics of monitoring data, we select four classic methods (IForest, KNN, LOF, OCSVM) in Table 1 as base detection methods.
IForest is based on the Decision Tree algorithm [39]. Many isolation trees make up an isolation forest to make anomaly detection results more credible. KNN is a distance-based algorithm [16]. It calculates each point's distance (such as Euclidean, Manhattan) with k nearest neighbors and sets the distance as an anomaly score. LOF is a density-based algorithm [6]. By comparing a point's local density to its neighbors' local densities, nodes with lower densities than their neighbors will be considered anomalies. OCSVM is based on Support Vector Machine (SVM) [18]. SVM can project data through a non-linear function to a high-dimensional space, and points are separated into different classes. Because kernel function calculation is time-consuming, it usually works slowly for large-scale data.
For each base method, the input is preprocessed data. The processing of input data includes model initialization, fitting data, and output anomaly scores. Model initialization includes the setup of hyper-parameters, such as anomaly fractions, which can be set based on data characteristics. After fitting the data, an anomaly score vector will be output. We use the anomaly score vector of each detection method to identify anomalies and evaluate the performance of each detection method.

Experiments and results Dataset
In our experiments, we use a Decentralized Application (DApp) monitoring data and two public datasets.
DApps monitoring data. In business scenarios where real-time transactions are required, e.g., energy trading or crowd journalisms [40], the Quality of Service (QoS) metrics of a DApp, such as transaction throughput, latency, and failure rates, are critical to the business value. To deliver such a quality-critical DApp in cloud environments, one needs to select cloud services carefully, customize their capacities, and monitor the run-time status of the application. Figure 1 shows a DApp example developed with Hyperledger Fabric 1 . For the DApp, different organizations, which contain many peer nodes, are deployed on different cloud infrastructure services (VMs) and monitored by a tool Prometheus 2 . We use Prometheus to collect real-time data and use Caliper 3 to simulate workload generation. For a running DApp, we mainly collect system resource metrics, which can be seen in Table 2. When the DApp receives transaction requests stably, we add system pressures with stress-ng 4 , such as disk pressure to inject anomalies manually. We increase disk pressure by 20 minutes every hour. We monitor the DApp for twelve hours and collect data at 15-second intervals. Ultimately, the DApp monitoring data contains 3237 samples and 229 resource-related metrics for our experiments. The general information can be seen in the Table 3.
Public dataset. SMD is divided into two subsets of equal size: the first half is the training set and the second half is the testing set. SMD (Server Machine Dataset) is a dataset collected and made publicly available by a large internet company [13]. It contains data collected from many different server machines and includes 38 metrics. In addition, domain experts have labeled anomalies in SMD based on incident reports.
Vichalana is a multivariate time-series dataset that can be used for performance anomaly detection in API Gateways [41]. It has different anomalies, such as high CPU and memory usage. Performance metrics in this dataset are collected when the system operates in normal and anomalous mode. The information of SMD and Vichalana data used in our experiments can be seen in Table 3.

Experimental settings
The DApp monitoring data is collected from a deployed DApp in a cloud environment. We use Microsoft Azure 5 as the cloud environment and deploy the monitor component and DApp separately. The monitor component is   For feature extraction in data pre-processing, PCA needs to retain as much variance information of the original data as possible, such as 95%. Therefore, we set reduction dimensions as 15 for DApp monitoring data, 5 for SMD data, and 6 for Vichalana data based on a calculated percentage of variance [38].
As for each base detection method, their hyper-parameters are set as below. Anomaly fractions need to be determined first. For the DApp monitoring data, because we inject anomalies 20 minutes every hour, the anomaly fraction is set as 0.3. For SMD and Vichalana data, we use the default anomaly fraction, which is 0.1. Next, the hyper-parameters of each base method need to be determined. We set the tree number for IForest to 100. The neighbor number in KNN is 5. In LOF, we set the neighbor number as 20. In OCSVM, we use the Radial Basis Function (RBF) kernel function.

Evaluation indicators
The performance of these detection methods is evaluated in three aspects: accuracy, robustness, and prediction ability. We use Precision, Recall, and F1 score to indicate accuracy. Precision is about how much of the data detected as anomalies is true anomalies, while recall is about how much of the real anomaly data is detected as anomalies. The F1 score is a function of both Precision and Recall.
Therefore, we mainly focus on the F1 score for detection accuracy. Our experiment results also evaluate and present the time spent on each unsupervised detection method and test time for the deep ensemble method. For robustness, we test detection methods on three different datasets and rank detection accuracy to represent performance consistency, which can clearly show the detection performance comparison [42]. We calculate robustness as the average ranking of detection methods on the three datasets. Finally, we normalize the rank and get the robustness score: Here, Rank max is the maximum of rank numbers, and Rank min is the minimum of rank numbers. We evaluate prediction ability with accuracy, which is also represented by the F1 score. We set the threshold of 0.8 and calculate the prediction score with Here, pt is the furthest predicted time in minutes. The prediction score considers both the furthest prediction time and prediction accuracy because the longer time and more accurate prediction can make it easier to avoid anomalies for applications. Finally, we define the indicator ARP_score for each method considering detection accuracy, robustness, and prediction as: Here, d is the number of datasets. We take the total of these three scores as the detection performance of a detection method on a dataset. We take the average of each detection method on different datasets with multiple datasets as the final indicator of its detection performance.

Experimental results
We apply the four base methods (IForest, KNN, LOF, and OCSVM) to the DApp monitoring data, SMD, and Vichalana data. The performance of their detection accuracy can be seen in Table 4.
(F 1 score i + Robustness score i + Prediction score i ) For the DApp monitoring data, we can see that the KNN has the highest F1 score, 0.8033, demonstrating that the data has clustering characteristics because KNN is good at identifying clusters in data. IForest takes into account different features in the data. IForest usually has good detection performance [43], as well as on the DApp monitoring data with an F1 score of 0.791. If the abnormal features are concentrated in a few dimensions, it will be hard to detect anomalies for LOF. Therefore, LOF has the lowest F1 score, 0.5143, for the DApp monitoring data. The F1 score of OCSVM is 0.737, which is not high enough because the projection through a kernel function cannot be divided into normal and abnormal data very well. For time spent, we can see that IForest and OCSVM spend about 0.3s, which is higher than other base methods because the calculation of features takes some time, but the time spent is under 0.5s overall, which is not high actually. As a result, for the DApp monitoring data, the KNN is the best of the four base methods.
For SMD data, we can see that IForest has the highest F1 score, 0.7515, which shows the advantage of IForest for anomaly classification through multiple features. However, F1 scores are not high for other base methods, showing too much noise in this dataset, and the overall distribution of normal and abnormal data is similar. Thus, we can say that anomalies may be mainly in a few features in the SMD data. In addition, the time spent on OCSVM is higher than on others because the kernel function calculation in OCSVM is time-consuming. On the other hand, IForest has the best detection accuracy and takes about 1.3 s, which is the best detection method.
For Vichalana data, we can see that OCSVM has the highest F1 score, 0.6778, showing that the non-linear projection can classify normal and abnormal data but is not very accurate. The F1 score of IForest is 0.658, slightly lower than OCSVM, which means that abnormal data distribution varies in different features, making it hard to detect. The F1 scores of KNN and LOF are pretty low, showing that the overall distribution of normal and abnormal data is also similar. It is worth noting that the time spent on OCSVM is relatively high because the dataset includes more than 40k samples, and it takes too much time for kernel function calculation in OCSVM.
Here, IForest only takes about 2s, which is quite faster than OCSVM.
In conclusion, we can see that detection accuracy is not high enough for these base detection methods. In addition, the performance of these methods varies for the three datasets. For example, KNN performs the best on the DApp monitoring data but relatively poorly on the SMD and Vichalana data. Furthermore, these detection methods have no prediction ability. Thus, for the three challenges: high accuracy, good robustness, and multi-step prediction, it is critical to develop suitable performance anomaly detection methods for cloud applications.

Ensemble learning-based detection framework
Base detection methods focus on different features in data and have diverse performances. Therefore, it is reasonable to consider that the integration of base methods can extract more features from data and improve detection performance. Furthermore, ensemble learning is proposed with the assumption that by combining several base models, the errors of a single model will be compensated by others. Therefore, we consider integrating base methods with ensemble learning and propose an Ensemble Learning-Based Detection (ELBD) framework, including three classic linear ensemble methods (maximum, average, and weighted average) and a deep ensemble method.

Basic idea
The ELBD framework can be seen in Fig. 2. First, input data is multivariate time-series monitoring data, including system and service level data, which can be collected and used as input. In this paper, we mainly focus on system resource data. We can represent input data as D t i ( i = [1, ..., n] is the index of resource metrics. n is number of resource metrics. t ∈ N * is the index of timestamps). Next, pre-processing needs to be done for the input data, including feature extraction and train/test split. Feature extraction has been introduced in Feature extraction section. There is no need to do the train/test split for unsupervised learning. However, the train/test split is important to avoid over-fitting for weakly-supervised learning. Therefore, we do the train/test split for the deep ensemble method, as seen in the experimental settings. After pre-processing, data D t j ( j = [1, ..., d] is the index of data dimensions. d is data dimensions after reduction) will be the input of anomaly detection methods.
The base method selection provides unsupervised detection methods. In this paper, we manually select four typical base methods, which have been introduced in detail in Base detection methods section. The output of base methods can be assembled as an anomaly score matrix. For the matrix, we provide three linear ensemble methods without training and a deep ensemble method, which needs to be trained with a neural network. The output of anomaly detection methods can be represented as C t m (m is the index of all detection methods). We mainly focus on accuracy, robustness, and multistep prediction ability to evaluate the multiple detection methods.

Linear ensemble methods
The outputs of base methods have different meanings and scales. For example, the anomaly score of IForest is calculated based on path depth, and KNN is based on distance. Because all the features should be measured in the same units, we apply z-score normalization [44] to ensure that all outputs have the same scale. The z-score method uses the mean and standard deviation of the original data for normalization so that the processed data follows the normal distribution. After normalization, we can represent the anomaly score vector C t k (k represents base detection methods) of each base method as O t k . Here, k is the index of base detection methods and k ∈ [1, r] , r is the number of base methods. Therefore, by taking each anomaly score vector as a column, we can get the anomaly scores matrix M: The left side of Table 5 can be seen as an example of the matrix. For matrix M, we provide linear ensemble methods first, including maximum ensemble, average ensemble, and weighted average ensemble.
The maximum ensemble is to select the max value of each row in matrix M and form a new anomaly score vector. The average ensemble is to calculate the average of each row and form a new anomaly score vector.
A limitation of the average ensemble is that each base detection method contributes equally to the final anomaly scores. However, some methods perform better or worse than others. Therefore, we can consider assigning different weights for these methods. For example, we assign more weights to better methods and fewer to worse ones. Weighted average ensemble is a method developed based on this idea. Based on the assumption that if a mixed model can maximize the information provided by each model, the mixed model has the best weight distribution strategy. Mutual Information (MI) can measure the difference between models, which can be used to calculate the weight of each base method [45]. To calculate the mutual information of two models, we first need to transfer anomaly scores into anomaly classes (0 or 1). We assume n samples in the two models, a and b. Next, we use N a 0 and N a 1 to represent the number of normal and abnormal data in model a, and N b 0 and N b 1 to represent the number of normal and abnormal data in model b. In addition, N ab 0 and N ab 1 represent the data that is detected as normal and abnormal by both models. Then we can calculate the MI of models a and b: To normalize it, we can calculate: Therefore, the average mutual information of base method is: Here, each base method is (k) . σ k is the standard value of the difference between models and σ k ∈ [0, 1] . The smaller the value, the greater the difference between the two models. Based on the difference value of each model, we calculate the weights with w k = σ k * Z , Z is the normalization factor. The new anomaly score vector can be calculated as: In Table 5, we provide five samples as an example to show how maximum, average, and weighted average ensemble methods work. In the left part of the table, we show the anomaly scores of four detection methods. In the right part, we can easily get the maximum and average anomaly scores. As for the weighted average ensemble, we assign the weights as (0.39, 0.28, 0.04, 0.29) for base methods based on the calculation. These new anomaly score vectors will be used to identify anomalies and evaluate the performance of these ensemble methods.

The deep ensemble method
The ensemble methods above try to combine different anomaly scores linearly. However, the linear combination may not represent the information extracted by each model well. Therefore, we provide a deep ensemble method in Fig. 3, and it combines base methods in a nonlinear way by using an Multi-Layer Perceptron (MLP). An MLP is a supplement to a feed-forward neural network. It consists of three layers: the input layer, the output layer, and the hidden layer. An MLP is suitable for classification or regression problems where inputs are assigned a class or real-value label. Therefore, the deep ensemble method is weakly-supervised and needs to be trained with some labels. Considering that there are fewer labels in reality, we design to train the deep ensemble with fewer labels and then test the trained model. We provide the MLP architecture in Fig. 3. The input layer receives the anomaly score matrix M at first. We have two hidden layers consisting of an arbitrary number of neurons and use ReLU as an activation function. The output layer has one neuron and outputs the probability using the softmax activation function. We define . W (1) and b (1) are weights and biases of the first layer. W (2) , b (2) and W (3) , b (3) are weights and bias of the two hidden layers. The output can be calculated based on the below functions.
For the output h (3) , we can calculate the difference between the predicted and actual results y with the (11) cross-entropy error function below. Here, y is the label at time t. The optimization goal is to minimize this equation by constantly adjusting parameters.
The deep ensemble method needs to be trained with fewer labels, and then the trained model can be applied to other data to detect anomalies. If we let y be the label of time t + s (s is steps), we can train a model with prediction ability. We provide an ELBD framework for improving detection accuracy, robustness, and predicting anomalies. Experimental results can be seen next.

Experimental settings
We design two experiments to evaluate the performance of the ELBD framework and compare them with results in Base performance anomaly detection methods section.
• Performance of methods in the ELBD framework.
To evaluate the improvement in detection accuracy and algorithm robustness, we compare the performance of methods in the ELBD framework with the best-performing base detection method. Experiment results can be seen in E1. • Multi-step prediction of the deep ensemble method.
As for the deep ensemble method, we evaluate its multi-step prediction ability, which can be seen in E2.
No hyper-parameter exists for maximum, average, and weighted average ensemble methods. We first do the Fig. 3 The architecture of deep ensemble method includes four steps: (a)pre-processing data is sent to four (b)base methods; then after normalization, the (c)ensemble of their outputs forms a score matrix; we finally input the score matrix into an (d)MLP for training train/test split for the deep ensemble method. Because there are fewer labels in real scenarios, we use only 10% of data with labels to train the model. Next, hyper-parameters in the MLP for the three datasets are the same. The input layer has 4 neurons because we have 4 base methods. In addition, we set 20 neurons in the two hidden layers and the output layer as 1. We train 100 epochs and set the batch size to 20. We use the Adam optimizer for stochastic gradient descent with an initial learning rate of 10 −3 during model training. We train the deep ensemble method 10 times. We show the error bar in figures and take the average of evaluation metrics in tables, such as F1 score and time, as the final result.

E1: Performance of methods in ELBD framework.
We provide different methods in the ELBD framework to improve detection performance. We apply these methods to the DApp monitoring, SMD, and Vichalana datasets to evaluate them. We compare these methods with the best-performing base method and evaluate the detection accuracy and robustness.
For the DApp monitoring data in Fig. 4, we can see that the F1 score of the weighted average ensemble is higher than KNN, maximum, and average ensemble, which shows that ensemble methods can improve the detection accuracy by integrating extracted information of base methods. In addition, the weighted average ensemble assigns weights to base methods to highlight their different contributions. The most noteworthy thing in Fig 4 is that the deep ensemble method has the highest F1 score, 0.8381. We train the deep ensemble method with only 10% labels, but the improvement is significant. The result shows that the nonlinear combination of base methods can extract more information and help improve detection accuracy. As for time spent, in Fig. 5, we can see that the deep ensemble method spends about 0.9s for data testing, and other ensemble methods spend about 0.8s. Time spent on each method for the DApp monitoring data is under 1s, which is not high overall.
For SMD data in Fig. 4, we can see that the F1 score of the IForest is 0.7515, which is higher than the maximum, average, and weighted average ensemble methods. Ensemble methods rely heavily on base methods, and other base methods (KNN, LOF, and OCSVM) perform poorly. The most important thing is that the deep ensemble has the best F1 score, 0.8152, which is much higher than other methods, showing its superior detection ability by integrating information non-linearly. Figure 5 presents the time spent of these methods. We can see that the maximum, average, and weighted average ensemble spend about 26.2s, and the deep ensemble spends about 27.8s. Still, ensemble methods rely on base methods, so their time spent is mainly because of the kernel function calculation in OCSVM and the computational cost of the neural network. For Vichalana data in Fig. 4, we can see that the F1 scores of the maximum and average ensembles are higher than OCSVM, which shows the detection performance improvement of ensemble-based methods. In contrast, the weighted average ensemble does not assign weights well. In addition, the deep ensemble has the best F1 score, 0.8438, which greatly improves detection accuracy compared with other methods, and it shows the advantages of the non-linear combination of base methods. Figure 5 presents the time spent of these methods. We can see that the maximum, average, and weighted average ensemble spend about 190s, and the deep ensemble spends about 194s. The time spent is still mainly because the large-scale data makes the kernel function calculation in OCSVM time-consuming. In addition, the neural network's computational cost takes a little time.
As for algorithm robustness, we provide rank results in the Table 6. We rank the detection accuracy of all methods, including base methods and methods in the ELBD framework, and calculate their average rank and robustness score, respectively. In the Table 6, we can see that the deep ensemble method has the best detection accuracy on the three different datasets, the DApp monitoring data, SMD, and Vichalana data, which shows that it has not only superior detection accuracy but outstanding robustness for different data distributions. Other ensemble methods have good robustness compared with base  detection methods. In contrast, base methods show performance inconsistency, except for IForest. IForest has quite good robustness compared with other base detection methods. In conclusion, we can say that methods in the ELBD framework improve detection performance in terms of detection accuracy and robustness, especially the deep ensemble method. E2: Multi-step prediction of the deep ensemble method. With the deep ensemble method, we can predict multi-step performance anomalies. We mainly test its prediction ability on the DApp monitoring data. The time interval in the DApp monitoring data is 15s. Thus, we can use every 4 steps, which is 1 minute, as the prediction step. Then, we predict whether the anomaly will happen or not after one or two or three minutes. To evaluate the prediction ability, we present the prediction accuracy with the F1 score in Fig. 6.
In Fig. 6, we can see that the longer the prediction time, the lower the detection accuracy, which means that it is difficult to predict long-term anomalies because dependency between data diminishes over time. In addition, we can see that all F1 scores are higher within four minutes  than 0.8, which is good detection accuracy. Therefore, we can say that it is available for the deep ensemble to predict anomalies in the next four minutes with high accuracy. We also show the time spent testing the prediction ability in Fig. 6. We can see that the testing time is around 1.1s, meaning that the deep ensemble method can predict anomalies quickly. For all the detection methods, we provide a Table 7 to compare their performance in terms of detection accuracy, algorithm robustness, and multi-step prediction. In the table, we can see that neither base detection nor linear ensemble methods have prediction ability. In addition, we can notice that IForest and weighted average ensemble methods have good detection accuracy and robustness. The most important thing is that the deep ensemble method perfectly addresses three challenges and has the highest ARP_score 5.1821, which is much better than other methods.
In conclusion, we provide the performance evaluation of ensemble methods in the ELBD framework. Our experiments show that these methods improve detection accuracy and robustness by integrating extracted information from base methods. Among those, the deep ensemble method has superior detection performance in terms of accuracy, robustness, and multi-step prediction. In addition, results show that the deep ensemble method can predict anomalies in the next four minutes with high accuracy.

Discussion
This paper provides an ELBD framework for performance anomaly detection and prediction of cloud applications. They are developed based on four base methods to improve detection performance. Our experiments evaluate the performance of methods in the ELBD framework and show an improvement in detection accuracy, robustness, and multi-step prediction ability. However, some aspects of these methods and experiments in this paper can still be improved.
For noise in monitoring data, we first provide feature extraction for pre-processing data. We use PCA to filter features and reduce data dimensions. The PCA is a general feature extraction method that can easily be used on many datasets and improve detection efficiency. However, PCA has some limitations, like assuming features in data are linearly dependent. Therefore, other feature extraction methods like AutoEncoder [46] can be considered in the future.
Our experiments show that the four base detection methods' performances vary on three datasets. The performance inconsistency is because each method extracts different features from the data. Moreover, the outputs of these base methods are assembled as the following methods' inputs, which will severely affect detection performance. In this paper, we manually select the four base detection methods based on their differences. However, a method to automatically select suitable base detection methods while considering data distribution can be researched in the future.
The capacity of the deep ensemble can be tested further. In our experiments, the deep ensemble is trained with fewer labels and has outstanding performance compared with other detection methods. Next, we can test the effects of different numbers of labels. Also, we can consider replacing the MLP with other deep neural networks like LSTM [47] to improve detection accuracy.
Performance anomaly detection methods can be applied to other monitoring data, such as blockchainlevel data in DApps. Furthermore, based on performance anomaly detection, root cause analysis can be researched in the future to localize root causes of performance anomalies. For example, when application response time is high, we need to determine the root causes of cloud resource problems or service-level delays.

Conclusions and future work
This paper focuses on performance anomaly detection and prediction of cloud applications, which need to satisfy three challenging requirements: high detection accuracy, robustness, and multi-step prediction. Based on our survey, many machine learning-based methods have been developed for performance anomaly detection. However, these detection methods have inconsistent performance for different datasets and rarely simultaneously solve the three requirements. Therefore, based on existing performance anomaly detection methods, we provide an ELBD framework that integrates existing detection methods to address the three requirements.
We first apply four base detection methods (IForest, KNN, LOF, OCSVM) to study the monitoring data characteristics. The results show that these base methods perform differently on datasets with different data patterns. Then, based on these methods, we develop an ELBD framework (maximum, average, weighted average, and deep ensemble) that integrates existing detection methods for improving detection performance. Our experiments show that methods in the ELBD framework significantly improve detection accuracy and robustness, especially the deep ensemble method. In addition, the deep ensemble method has the multi-step prediction ability, which can predict anomalies in the next four minutes with high accuracy. We also evaluate detection performance with our indicator, and the results show that the deep ensemble method has the highest ARP_score 5.1821, which is much better than other methods. This paper provides an ensemble-based framework for performance anomaly detection of cloud applications, and the results show that the AI-based deep ensemble method has superior performance in terms of detection accuracy, robustness, and prediction ability. However, some aspects of this research can still be improved. For example, we can perform feature selection for multivariate monitoring data, and more experiments and extensions for deep ensemble methods can be researched in the future. In addition, for applying AI methods to help operators and developers better implement performance management of cloud applications, several future research directions can be discussed based on [48].
Data security. For a running cloud application, largescale monitoring data is collected, which makes it necessary to consider implementing secure data governance for collected data. Collected performance data is mostly stored in centralized or distributed environments, with a high risk of being attacked or stolen [49]. Blockchainbased data storage has been developed recently in IoT [50]. However, blockchain-based storage technologies still have challenges such as durability, availability, and cost, which need to be explored more in the future.
Data labeling. High-quality labeled data can be very helpful in improving detection accuracy. However, there are fewer labels in real scenarios, and labeling data manually is onerous and time-consuming. Nowadays, active learning [51] has been developed to solve label issues by combining both machine and human labor. Therefore, we consider that automated data annotation methods based on active learning can be explored more in the future, for example, by reducing human labor and improving the quality of labels.
Detection efficiency. Except for model robustness and accuracy, efficiency is important for detection methods to meet users' requirements considering a large number of performance data exists. Machine learning methods, especially deep learning methods, usually have high detection accuracy but time-consuming model training [52]. Only a few statistical-based methods for improving detection efficiency have been developed [53]. Therefore, improving the model efficiency and achieving accurate real-time online detection is worth exploring in the future.
Model explainability. For detected anomalies, it is natural to explore why these anomalies happen. Explainable AI [54] has been researched for deep learning models, which are typically viewed as black boxes. Self-explainable methods like IForest have been explored in this paper. However, the explanation of the deep ensemble method can be explored more in the future. In addition, root cause localization to identify metrics that cause anomalies should be investigated more in the future despite complex dependencies between metrics.