Robust and accurate performance anomaly detection and prediction for cloud applications: a novel ensemble learning-based framework

Xin, Ruyue; Liu, Hongyun; Chen, Peng; Zhao, Zhiming

doi:10.1186/s13677-022-00383-6

Research
Open access
Published: 14 January 2023

Robust and accurate performance anomaly detection and prediction for cloud applications: a novel ensemble learning-based framework

Ruyue Xin¹,
Hongyun Liu¹,
Peng Chen² &
…
Zhiming Zhao¹

Journal of Cloud Computing volume 12, Article number: 7 (2023) Cite this article

4952 Accesses
16 Citations
Metrics details

Abstract

Effectively detecting run-time performance anomalies is crucial for clouds to identify abnormal performance behavior and forestall future incidents. To be used for real-world applications, an effective anomaly detection framework should meet three main challenging requirements: high accuracy for identifying anomalies, good robustness when application patterns change, and prediction ability for upcoming anomalies. Unfortunately, existing research about performance anomaly detection usually focuses on improving detection accuracy, while little research tackles the three challenges simultaneously. We conduct experiments for existing detection methods on multiple application monitoring data, and results show that existing detection methods usually focus on different features in data, which will lead to their diverse performance on different data patterns. Therefore, existing anomaly detection methods have difficulty improving detection accuracy and robustness and predicting anomalies. To address the three requirements, we propose an Ensemble Learning-Based Detection (ELBD) framework which integrates existing well-selected detection methods. The framework includes three classic linear ensemble methods (maximum, average, and weighted average) and a novel deep ensemble method. Our experiments show that the ELBD framework realizes better detection accuracy and robustness, where the deep ensemble method can achieve the most accurate and robust detection for cloud applications. In addition, it can predict anomalies in the next four minutes with an F1 score higher than 0.8. The paper also proposes a new indicator $ARP\_score$ to measure detection accuracy, robustness, and multi-step prediction ability. The $ARP\_score$ of the deep ensemble method is 5.1821, which is much higher than other detection methods.

Introduction

The run-time status of cloud applications can be continuously monitored through system-related metrics, e.g., CPU and memory usage [1]. Performance anomaly detection plays a vital role in operating cloud services, and applications [2, 3]. Cloud performance anomalies such as degraded response time, often caused by underlying system resource shortages, may severely affect the quality of an application’s user experience (QoE) and service (QoS). Therefore, effectively analyzing patterns of monitoring system-related metrics and identifying abnormal performance in real-time is crucial for continuously delivering the business value of a cloud application. In this context, we can highlight three challenging requirements for a performance anomaly detection framework. First, the detection must achieve high accuracy to ensure anomalies can be found as accurately as possible. Second, detection algorithm robustness is essential. Different data distributions exist in multiple monitoring data, which requires a robust algorithm to meet changes in data patterns and maintain performance consistency. Finally, to prevent potential application violations effectively, it is vital to make a multi-step prediction of future anomalies.

Existing anomaly detection methods have often been developed using statistics [4] or machine learning [5, 6] based methods. Most methods focus on improving detection accuracy. For example, Audibert et al. [7] developed the USAD based on an adversely trained AutoEncoder and achieved the best detection accuracy. Studies on improving the robustness of detection methods usually use adversarial training, which needs to make a trade-off between robustness and accuracy [8], rather than simultaneously improving accuracy and robustness. In addition, research on anomaly prediction mainly focuses on one-step prediction [9], which has limited effect in preventing potential performance anomalies. Existing research explores different aspects of the three challenging requirements, but few studies simultaneously tackle the challenges of accuracy, robustness, and multi-step prediction ability. Besides, there are also no effective indicators to measure the combination of the three requirements.

Moreover, the development of performance anomaly detection has to handle two data challenges.

Missing data labels. Most of the monitoring data does not contain labels that can be immediately used for training a machine learning-based model, and labeling time-series data is often manual and time-consuming.
Data noise. Monitoring data collected from a distributed network often contain noises, which can significantly influence the performance of the anomaly detection methods and increase the false-positive detection.

Thus, for performance anomaly detection, we define our research question as “how to effectively detect and predict performance anomalies with high accuracy and good robustness?”. To address the two data challenges, we focus on unsupervised and weakly supervised detection methods and provide feature extraction to filter noise in data. To answer our research question, we first explore existing unsupervised anomaly detection methods and observe their detection performance on different datasets. Then, to improve detection accuracy, robustness, and prediction ability, we develop an Ensemble Learning-Based Detection (ELBD) framework that incorporates classic detection methods rather than enhances a single model. The contributions in this paper mainly include:

We characterize four typical base detection methods on different datasets, and the results show that their detection performance is not good for detection accuracy, robustness, and prediction.
Based on base detection methods, we propose an ELBD framework including three classic linear ensemble methods (maximum, average, and weighted average) and a deep ensemble method.
We propose $ARP\_score$ to evaluate detection performance in terms of accuracy, robustness, and multi-step prediction.
We evaluated the methods in the ELBD framework on different datasets, and the results show that the deep ensemble method achieves the highest $ARP\_score$ 5.1821.

The rest of the paper is organized as follows. In Related works section, we review existing performance anomaly detection methods, specifically ensemble learning. In Base performance anomaly detection methods section, we provide base detection methods and an evaluation of their performance. In Ensemble learning-based detection framework section, we propose the ELBD framework and evaluate detection accuracy, robustness, and prediction ability. Finally, discussion and conclusion are provided in Discussion and Conclusion and future work sections.

Related works

Performance anomaly detection is a process of detecting abnormal performance phenomena and predicting anomalies to forestall future incidents [10]. Research about performance anomaly detection is ongoing rapidly, and machine learning methods are widely applied [11]. This section will briefly review machine learning-based anomaly detection methods and specifically highlight ensemble learning.

Machine learning-based anomaly detection methods

Machine learning-based anomaly detection methods can be reviewed in terms of supervised, semi-supervised, and unsupervised learning. Supervised learning methods have high accuracy [5], but they are ineffective for application monitoring data because data labels are usually missing in reality and manually labeling data manually is time-consuming. Semi-supervised learning methods are developed when fewer labels exist, and unsupervised learning methods are used when no labels exist. Semi-supervised methods typically outperform unsupervised methods, but unsupervised methods are better suited for actual industrial scenarios [12].

In Table 1, we provide a classification of unsupervised performance anomaly detection methods. The table includes traditional methods such as tree-based, kernel-based, distance-based, and density-based. They usually focus on different features in data, and their performance varies for different datasets, which will be verified in Base performance anomaly detection methods section. Deep learning methods are also developing rapidly recently. For example, Su et al. [13] provide a stochastic recurrent neural network named OmniAnomaly for multivariate time series anomaly detection. Deep learning methods can achieve high detection accuracy, but model training is usually time-consuming.

Table 1 A classification of classic unsupervised performance anomaly detection methods

Full size table

Researchers usually improve algorithm robustness through adversarial training, which uses deep learning methods to defend generated adversarial examples [23]. For example, Hashemi et al. [24] enhance the robustness of an intrusion detection system in the presence of adversarial examples by utilizing denoising autoencoders. However, there is usually a trade-off between model accuracy and robustness [8], which makes it a challenge to improve model robustness and accuracy simultaneously. In addition, research on anomaly prediction usually focuses on univariate data and one-step prediction. For example, Wu et al. [9] provide a prediction-driven anomaly detection method that relies on Long Short Term Memory (LSTM) with univariate time-series data.

In conclusion, machine learning methods, especially semi-supervised and unsupervised methods, can be considered for performance anomaly detection because fewer labels exist. While different methods usually focus on different data features, we can consider integrating existing methods, for example, LOF, KNN, OCSVM, and IForest, in Table 1. To improve detection accuracy and robustness simultaneously, ensemble learning instead of adversarial training to integrate existing detection methods can be considered. We will introduce related work to ensemble learning next. In addition, we provide a multi-step prediction based on multi-variate metrics for performance anomalies in this paper.

Ensemble learning

Ensemble learning is proposed to improve the accuracy and reduce the variance of an automated decision-making system [25]. The primary assumption of ensemble learning is that by combining several base models, the errors of a single model will likely be compensated by other models [26]. For anomaly detection, the ensemble of anomaly scores by taking the maximum, and average actions can be found in [27]. Research about ensemble learning can be reviewed based on supervised classification, semi-supervised and unsupervised clustering ensemble.

Some research already focuses on ensemble learning with machine learning methods. As for supervised ensemble learning, Tyralis et al. [28] propose an ensemble learning method by combining ten machine learning algorithms and estimating the weights through a k-fold cross-validation procedure. Tama et al. [29] propose a stacked ensemble, which uses three classifiers (random forest, gradient boosting machine, and XGBoost) and provides a generalized linear model (GLM) as a combiner. Adeyemo et al. [30] focus on network intrusion detection and implement two ensemble methods and a deep learning method (LSTM). The two ensemble methods include a homogeneous method that uses an optimized bagged random forest algorithm and a heterogeneous method that is an averaged probability method of a voting ensemble for four standard classifiers. These studies of ensemble learning mainly focus on weight calculations or linear combinations of different base models.

Semi-supervised ensemble learning mainly focuses on expanding the labeled training set and utilizing the expanded training set to do classification or regression [31]. For example, Jian et al. [32] present a sample information-based synthetic minority oversampling technique to balance the labeled dataset and use variable weighted voting for integrating base models. This research focuses on the data label issue with semi-supervised learning, but the ensemble is linear. Unsupervised ensemble learning, also known as consensus clustering, is to find the optimal combination strategy of individual clustering. Ensemble clustering can be classified into three categories, pair-wise co-occurrence based methods [33], graph partitioning based methods [34] and median partition-based methods [35]. Unlu et al. [36] provide a weighting policy based on internal clustering quality measures, which gives different importance to individual clustering. This research provides a weighted policy to integrate individual models, but it works linearly.

For ensemble learning, we can see that combining different methods is challenging, and many methods focus on linear combinations, which limits information extraction and detection performance improvement. Therefore, to get an effective model that improves detection accuracy and robustness and makes prediction of performance anomaly detection, the nonlinear combination of different base detection methods is further investigated in this paper.

Base performance anomaly detection methods

For performance anomaly detection of cloud applications, we provide feature extraction for original data and explore the performance of four base detection methods in this section.

Problem definition

Multivariate time-series data are timestamped data points sequences and can be represented as D. Then each data point will be $D_i^t$ ($i=[1,...,n]$ is the index of resource metrics. n is the number of resource metrics. $t\in N^{*}$ is the index of timestamps). Multivariate time-series data anomaly detection is to learn the characteristics of data D and determine whether an observation $D_{n+1}$ is anomalous or not. For multi-step anomaly prediction, we will use data D for training, and determine whether $D_{n+1}, D_{n+2},..., D_{n+p}$ is anomalous.

In this paper, we first provide the performance of classic detection methods. Then we propose an ELBD framework, which is developed based on ensemble learning and aims to improve detection accuracy and robustness by integrating information extracted by classic detection methods non-linearly. In addition, we implement multi-step prediction ability in the deep ensemble method in ELBD framework.

Feature extraction

Multivariate data usually contains noise, which can induce unnecessary variance in a model. Therefore, preprocessing data through feature extraction to remove redundant information and reduce data dimension is needed. For feature extraction, Principal Components Analysis (PCA) [37] is a classic and most used method. PCA is an unsupervised method that uses eigenvalue decomposition to compress and denoise data, which is suitable as the feature extraction method considering there are no labels or fewer labels in reality.

PCA is a method to transform a dataset with lots of variables into a smaller one containing most of the original information. The process steps of PCA are: 1) getting the covariance matrix of original features; 2) calculating eigenvectors and eigenvalues of the covariance matrix to identify principal components; 3) sorting eigenvalues and selecting eigenvectors with high eigenvalues as feature vectors; 4) recasting original data based on feature vectors. In step 3, the number of selected eigenvectors determines the data dimensions after reduction. In practice, we set the reduction dimension based on a calculated percentage of variance [38]. According to these calculations, PCA achieves principal feature selection and data dimension reduction. Finally, we apply PCA to all monitoring data and use the data with low dimensions as the input of anomaly detection methods.

Base detection methods

Different anomaly detection methods usually focus on different features in data, such as density-based and distance-based, and result in diverse performance on data. Therefore, to comprehensively understand the characteristics of monitoring data, we select four classic methods (IForest, KNN, LOF, OCSVM) in Table 1 as base detection methods.

IForest is based on the Decision Tree algorithm [39]. Many isolation trees make up an isolation forest to make anomaly detection results more credible. KNN is a distance-based algorithm [16]. It calculates each point’s distance (such as Euclidean, Manhattan) with k nearest neighbors and sets the distance as an anomaly score. LOF is a density-based algorithm [6]. By comparing a point’s local density to its neighbors’ local densities, nodes with lower densities than their neighbors will be considered anomalies. OCSVM is based on Support Vector Machine (SVM) [18]. SVM can project data through a non-linear function to a high-dimensional space, and points are separated into different classes. Because kernel function calculation is time-consuming, it usually works slowly for large-scale data.

For each base method, the input is preprocessed data. The processing of input data includes model initialization, fitting data, and output anomaly scores. Model initialization includes the setup of hyper-parameters, such as anomaly fractions, which can be set based on data characteristics. After fitting the data, an anomaly score vector will be output. We use the anomaly score vector of each detection method to identify anomalies and evaluate the performance of each detection method.

Experiments and results

Dataset

In our experiments, we use a Decentralized Application (DApp) monitoring data and two public datasets.

DApps monitoring data. In business scenarios where real-time transactions are required, e.g., energy trading or crowd journalisms [40], the Quality of Service (QoS) metrics of a DApp, such as transaction throughput, latency, and failure rates, are critical to the business value. To deliver such a quality-critical DApp in cloud environments, one needs to select cloud services carefully, customize their capacities, and monitor the run-time status of the application. Figure 1 shows a DApp example developed with Hyperledger Fabric^{Footnote 1}. For the DApp, different organizations, which contain many peer nodes, are deployed on different cloud infrastructure services (VMs) and monitored by a tool Prometheus^{Footnote 2}. We use Prometheus to collect real-time data and use Caliper^{Footnote 3} to simulate workload generation.

For a running DApp, we mainly collect system resource metrics, which can be seen in Table 2. When the DApp receives transaction requests stably, we add system pressures with stress-ng^{Footnote 4}, such as disk pressure to inject anomalies manually. We increase disk pressure by 20 minutes every hour. We monitor the DApp for twelve hours and collect data at 15-second intervals. Ultimately, the DApp monitoring data contains 3237 samples and 229 resource-related metrics for our experiments. The general information can be seen in the Table 3.

Table 2 Description of system resource metrics

Full size table

Table 3 General information of three datasets

Full size table

Public dataset. SMD is divided into two subsets of equal size: the first half is the training set and the second half is the testing set. SMD (Server Machine Dataset) is a dataset collected and made publicly available by a large internet company [13]. It contains data collected from many different server machines and includes 38 metrics. In addition, domain experts have labeled anomalies in SMD based on incident reports.

Vichalana is a multivariate time-series dataset that can be used for performance anomaly detection in API Gateways [41]. It has different anomalies, such as high CPU and memory usage. Performance metrics in this dataset are collected when the system operates in normal and anomalous mode. The information of SMD and Vichalana data used in our experiments can be seen in Table 3.

Experimental settings

The DApp monitoring data is collected from a deployed DApp in a cloud environment. We use Microsoft Azure^{Footnote 5} as the cloud environment and deploy the monitor component and DApp separately. The monitor component is deployed on a VM, with the following properties: Ubuntu 18.04 as the operating system; 2CPU; 4G Memory; and 32GB of Storage. The DApp is deployed on VMs with Ubuntu 18.04 as the operating system, 4CPU, 16G memory, and 32GB of storage.

For feature extraction in data pre-processing, PCA needs to retain as much variance information of the original data as possible, such as 95%. Therefore, we set reduction dimensions as 15 for DApp monitoring data, 5 for SMD data, and 6 for Vichalana data based on a calculated percentage of variance [38].

As for each base detection method, their hyper-parameters are set as below. Anomaly fractions need to be determined first. For the DApp monitoring data, because we inject anomalies 20 minutes every hour, the anomaly fraction is set as 0.3. For SMD and Vichalana data, we use the default anomaly fraction, which is 0.1. Next, the hyper-parameters of each base method need to be determined. We set the tree number for IForest to 100. The neighbor number in KNN is 5. In LOF, we set the neighbor number as 20. In OCSVM, we use the Radial Basis Function (RBF) kernel function.

Evaluation indicators

The performance of these detection methods is evaluated in three aspects: accuracy, robustness, and prediction ability. We use Precision, Recall, and F1 score to indicate accuracy. Precision is about how much of the data detected as anomalies is true anomalies, while recall is about how much of the real anomaly data is detected as anomalies. The F1 score is a function of both Precision and Recall.

$$\begin{aligned} F1\ score = 2*\frac{Precision*Recall}{Precision+Recall} \end{aligned}$$

(1)

Therefore, we mainly focus on the F1 score for detection accuracy. Our experiment results also evaluate and present the time spent on each unsupervised detection method and test time for the deep ensemble method. For robustness, we test detection methods on three different datasets and rank detection accuracy to represent performance consistency, which can clearly show the detection performance comparison [42]. We calculate robustness as the average ranking of detection methods on the three datasets. Finally, we normalize the rank and get the robustness score:

$$\begin{aligned} Robustness\ score = \frac{Rank-Rank_{max}}{Rank_{min}-Rank_{max}} \end{aligned}$$

(2)

Here, $Rank_{max}$ is the maximum of rank numbers, and $Rank_{min}$ is the minimum of rank numbers. We evaluate prediction ability with accuracy, which is also represented by the F1 score. We set the threshold of 0.8 and calculate the prediction score with

$$\begin{aligned} Prediction\ score = \sum\limits_{i=1}^{pt}{F1\ score_i} \end{aligned}$$

(3)

Here, pt is the furthest predicted time in minutes. The prediction score considers both the furthest prediction time and prediction accuracy because the longer time and more accurate prediction can make it easier to avoid anomalies for applications. Finally, we define the indicator $ARP\_score$ for each method considering detection accuracy, robustness, and prediction as:

$$\begin{aligned} ARP\_score= & {} \frac{1}{d}\sum\limits_{i=1}^d{(F1\ score_i+Robustness\ score_i}\nonumber \\&+Prediction\ score_i) \end{aligned}$$

(4)

Here, d is the number of datasets. We take the total of these three scores as the detection performance of a detection method on a dataset. We take the average of each detection method on different datasets with multiple datasets as the final indicator of its detection performance.

Experimental results

We apply the four base methods (IForest, KNN, LOF, and OCSVM) to the DApp monitoring data, SMD, and Vichalana data. The performance of their detection accuracy can be seen in Table 4.

Table 4 Performance of different detection methods on three datasets. For each dataset, the F1 score of the best detection method is shown in bold

Full size table

For the DApp monitoring data, we can see that the KNN has the highest F1 score, 0.8033, demonstrating that the data has clustering characteristics because KNN is good at identifying clusters in data. IForest takes into account different features in the data. IForest usually has good detection performance [43], as well as on the DApp monitoring data with an F1 score of 0.791. If the abnormal features are concentrated in a few dimensions, it will be hard to detect anomalies for LOF. Therefore, LOF has the lowest F1 score, 0.5143, for the DApp monitoring data. The F1 score of OCSVM is 0.737, which is not high enough because the projection through a kernel function cannot be divided into normal and abnormal data very well. For time spent, we can see that IForest and OCSVM spend about 0.3s, which is higher than other base methods because the calculation of features takes some time, but the time spent is under 0.5s overall, which is not high actually. As a result, for the DApp monitoring data, the KNN is the best of the four base methods.

For SMD data, we can see that IForest has the highest F1 score, 0.7515, which shows the advantage of IForest for anomaly classification through multiple features. However, F1 scores are not high for other base methods, showing too much noise in this dataset, and the overall distribution of normal and abnormal data is similar. Thus, we can say that anomalies may be mainly in a few features in the SMD data. In addition, the time spent on OCSVM is higher than on others because the kernel function calculation in OCSVM is time-consuming. On the other hand, IForest has the best detection accuracy and takes about 1.3 s, which is the best detection method.

For Vichalana data, we can see that OCSVM has the highest F1 score, 0.6778, showing that the non-linear projection can classify normal and abnormal data but is not very accurate. The F1 score of IForest is 0.658, slightly lower than OCSVM, which means that abnormal data distribution varies in different features, making it hard to detect. The F1 scores of KNN and LOF are pretty low, showing that the overall distribution of normal and abnormal data is also similar. It is worth noting that the time spent on OCSVM is relatively high because the dataset includes more than 40k samples, and it takes too much time for kernel function calculation in OCSVM. Here, IForest only takes about 2s, which is quite faster than OCSVM.

In conclusion, we can see that detection accuracy is not high enough for these base detection methods. In addition, the performance of these methods varies for the three datasets. For example, KNN performs the best on the DApp monitoring data but relatively poorly on the SMD and Vichalana data. Furthermore, these detection methods have no prediction ability. Thus, for the three challenges: high accuracy, good robustness, and multi-step prediction, it is critical to develop suitable performance anomaly detection methods for cloud applications.

Ensemble learning-based detection framework

Base detection methods focus on different features in data and have diverse performances. Therefore, it is reasonable to consider that the integration of base methods can extract more features from data and improve detection performance. Furthermore, ensemble learning is proposed with the assumption that by combining several base models, the errors of a single model will be compensated by others. Therefore, we consider integrating base methods with ensemble learning and propose an Ensemble Learning-Based Detection (ELBD) framework, including three classic linear ensemble methods (maximum, average, and weighted average) and a deep ensemble method.

Basic idea

The ELBD framework can be seen in Fig. 2. First, input data is multivariate time-series monitoring data, including system and service level data, which can be collected and used as input. In this paper, we mainly focus on system resource data. We can represent input data as $D_i^t$ ($i=[1,...,n]$ is the index of resource metrics. n is number of resource metrics. $t\in N^{*}$ is the index of timestamps). Next, pre-processing needs to be done for the input data, including feature extraction and train/test split. Feature extraction has been introduced in Feature extraction section. There is no need to do the train/test split for unsupervised learning. However, the train/test split is important to avoid over-fitting for weakly-supervised learning. Therefore, we do the train/test split for the deep ensemble method, as seen in the experimental settings. After pre-processing, data $D_j^t$ ($j=[1,...,d]$ is the index of data dimensions. d is data dimensions after reduction) will be the input of anomaly detection methods.

The base method selection provides unsupervised detection methods. In this paper, we manually select four typical base methods, which have been introduced in detail in Base detection methods section. The output of base methods can be assembled as an anomaly score matrix. For the matrix, we provide three linear ensemble methods without training and a deep ensemble method, which needs to be trained with a neural network. The output of anomaly detection methods can be represented as $C_m^t$ (m is the index of all detection methods). We mainly focus on accuracy, robustness, and multi-step prediction ability to evaluate the multiple detection methods.

Linear ensemble methods

The outputs of base methods have different meanings and scales. For example, the anomaly score of IForest is calculated based on path depth, and KNN is based on distance. Because all the features should be measured in the same units, we apply z-score normalization [44] to ensure that all outputs have the same scale. The z-score method uses the mean and standard deviation of the original data for normalization so that the processed data follows the normal distribution. After normalization, we can represent the anomaly score vector $C_k^t$ (k represents base detection methods) of each base method as $O_k^t$. Here, k is the index of base detection methods and $k\in {[1,r]}$, r is the number of base methods. Therefore, by taking each anomaly score vector as a column, we can get the anomaly scores matrix M:

$$\begin{aligned} M = \left[ \begin{array}{cccc} O_{1}^{1} &{} O_{2}^{1} &{} O_{3}^{1} &{} O_{4}^{1} \\ O_{1}^{2} &{} O_{2}^{2} &{} O_{3}^{2} &{} O_{4}^{2} \\ \vdots &{} \vdots &{} \vdots &{} \vdots \\ O_{1}^{t} &{} O_{2}^{t} &{} O_{3}^{t} &{} O_{4}^{t} \\ \vdots &{} \vdots &{} \vdots &{} \vdots \end{array}\right] \end{aligned}$$

The left side of Table 5 can be seen as an example of the matrix. For matrix M, we provide linear ensemble methods first, including maximum ensemble, average ensemble, and weighted average ensemble.

The maximum ensemble is to select the max value of each row in matrix M and form a new anomaly score vector.

$$\begin{aligned} V_{max} = {\underset{k}{\max}}\ O_{k}^{t}, t\in N^{*} \end{aligned}$$

(5)

The average ensemble is to calculate the average of each row and form a new anomaly score vector.

$$\begin{aligned} V_{avg} = \frac{1}{r} \sum\limits_{k=1}^{r} O_{k}^{t}, t\in N^{*} \end{aligned}$$

(6)

A limitation of the average ensemble is that each base detection method contributes equally to the final anomaly scores. However, some methods perform better or worse than others. Therefore, we can consider assigning different weights for these methods. For example, we assign more weights to better methods and fewer to worse ones. Weighted average ensemble is a method developed based on this idea.

Based on the assumption that if a mixed model can maximize the information provided by each model, the mixed model has the best weight distribution strategy. Mutual Information (MI) can measure the difference between models, which can be used to calculate the weight of each base method [45]. To calculate the mutual information of two models, we first need to transfer anomaly scores into anomaly classes (0 or 1). We assume n samples in the two models, a and b. Next, we use $N_0^a$ and $N_1^a$ to represent the number of normal and abnormal data in model a, and $N_0^b$ and $N_1^b$ to represent the number of normal and abnormal data in model b. In addition, $N_0^{ab}$ and $N_1^{ab}$ represent the data that is detected as normal and abnormal by both models. Then we can calculate the MI of models a and b:

$$\begin{aligned} I(A,B)= & {} N_0^{ab}\log {\frac{n*N_0^{ab}}{N_0^a*N_0^b}} + (N_0^a-N_0^{ab})\log {\frac{n*(N_0^a-N_0^{ab})}{N_0^a*N_1^b}} \nonumber \\&\ + (N_0^b-N_0^{ab})\log {\frac{n*(N_0^b-N_0^{ab})}{N_1^a*N_0^b}} + N_1^{ab}\log {\frac{n*N_1^{ab}}{N_1^a*N_1^b}} \end{aligned}$$

(7)

To normalize it, we can calculate:

$$\phi(A,B)=\frac{I(A,B)}{\sqrt{\left(\sum_{i=0}^1N_i^a\log\frac{N_i^a}n\right)\left(\sum_{i=0}^1N_i^b\log\frac{N_i^b}n\right)}}$$

(8)

Therefore, the average mutual information of base method is:

$$\sigma_k=\frac1{r-1}\sum_{l=1,l\neq k}^r\phi\left(\lambda^{(k)},\lambda^{(l)}\right),k\in{\lbrack1,r\rbrack}$$

(9)

Here, each base method is $\lambda ^{(k)}$. $\sigma _k$ is the standard value of the difference between models and $\sigma _k\in [0,1]$. The smaller the value, the greater the difference between the two models. Based on the difference value of each model, we calculate the weights with $w_k = \sigma _k*Z$, Z is the normalization factor. The new anomaly score vector can be calculated as:

$$\begin{aligned} V_{w\_avg} = \frac{1}{r} \sum\limits_{k=1}^{r}\sigma _{k} *O_{k}^{t}, t \in N^{*} \end{aligned}$$

(10)

In Table 5, we provide five samples as an example to show how maximum, average, and weighted average ensemble methods work. In the left part of the table, we show the anomaly scores of four detection methods. In the right part, we can easily get the maximum and average anomaly scores. As for the weighted average ensemble, we assign the weights as (0.39, 0.28, 0.04, 0.29) for base methods based on the calculation. These new anomaly score vectors will be used to identify anomalies and evaluate the performance of these ensemble methods.

Table 5 Linear ensemble methods example: on the left side is anomaly scores obtained by each base method; on the right side is anomaly scores obtained through ensemble methods

Full size table

The deep ensemble method

The ensemble methods above try to combine different anomaly scores linearly. However, the linear combination may not represent the information extracted by each model well. Therefore, we provide a deep ensemble method in Fig. 3, and it combines base methods in a nonlinear way by using an Multi-Layer Perceptron (MLP). An MLP is a supplement to a feed-forward neural network. It consists of three layers: the input layer, the output layer, and the hidden layer. An MLP is suitable for classification or regression problems where inputs are assigned a class or real-value label. Therefore, the deep ensemble method is weakly-supervised and needs to be trained with some labels. Considering that there are fewer labels in reality, we design to train the deep ensemble with fewer labels and then test the trained model.

We provide the MLP architecture in Fig. 3. The input layer receives the anomaly score matrix M at first. We have two hidden layers consisting of an arbitrary number of neurons and use ReLU as an activation function. The output layer has one neuron and outputs the probability using the softmax activation function. We define $x=[O_1^t, O_2^t, O_3^t, O_4^t]$. $W^{(1)}$ and $b^{(1)}$ are weights and biases of the first layer. $W^{(2)}$, $b^{(2)}$ and $W^{(3)}$, $b^{(3)}$ are weights and bias of the two hidden layers. The output can be calculated based on the below functions.

$$\begin{aligned} z^{(1)}= & {} W^{(1)}x + b^{(1)}, \nonumber \\ h^{(1)}= & {} ReLu(z^{(1)}), \nonumber \\ z^{(2)}= & {} W^{(2)}h^{(1)} + b^{(2)}, \nonumber \\ h^{(2)}= & {} ReLu(z^{(2)}), \nonumber \\ z^{(3)}= & {} W^{(3)}h^{(2)} + b^{(3)}, \nonumber \\ h^{(3)}= & {} softmax\left( z^{(3)}\right) \end{aligned}$$

(11)

For the output $h^{(3)}$, we can calculate the difference between the predicted and actual results y with the cross-entropy error function below. Here, y is the label at time t. The optimization goal is to minimize this equation by constantly adjusting parameters.

$$\begin{aligned} l = -y^{T}\log {h^{(3)}} \end{aligned}$$

(12)

The deep ensemble method needs to be trained with fewer labels, and then the trained model can be applied to other data to detect anomalies. If we let y be the label of time $t+s$ (s is steps), we can train a model with prediction ability. We provide an ELBD framework for improving detection accuracy, robustness, and predicting anomalies. Experimental results can be seen next.