Skip to main content

Advances, Systems and Applications

ASOD: an adaptive stream outlier detection method using online strategy

Abstract

In the current era of information technology, blockchain is widely used in various fields, and the monitoring of the security and status of the blockchain system is of great concern. Online anomaly detection for the real-time stream data plays vital role in monitoring strategy to find abnormal events and status of blockchain system. However, as the high requirements of real-time and online scenario, online anomaly detection faces many problems such as limited training data, distribution drift, and limited update frequency. In this paper, we propose an adaptive stream outlier detection method (ASOD) to overcome the limitations. It first designs a K-nearest neighbor Gaussian mixture model (KNN-GMM) and utilizes online learning strategy. So, it is suitable for online scenarios and does not rely on large training data. The K-nearest neighbor optimization limits the influence of new data locally rather than globally, thus improving the stability. Then, ASOD applies the mechanism of dynamic maintenance of Gaussian components and the strategy of dynamic context control to achieve self-adaptation to the distribution drift. And finally, ASOD adopts a dimensionless distance metric based on Mahalanobis distance and proposes an automatic threshold method to accomplish anomaly detection. In addition, the KNN-GMM provides the life cycle and the anomaly index for continuous tracking and analysis, which facilities the cause analysis and further interpretation and traceability. From the experimental results, it can be seen that ASOD achieves near-optimal F1 and recall on the NAB dataset with an improvement of 6% and 20.3% over the average, compared to baselines with sufficient training data. ASOD has the lowest F1 variance among the five best methods, indicating that it is effective and stable for online anomaly detection on stream data.

Introduction

Over the past decade, blockchain technology has attracted a huge attention from both industry and academia because it can be integrated with a large number of everyday applications of modern information and communication technologies [1,2,3].

Blockchain technology now extends well beyond the scope of finance. High-value data exchanges are increasingly vital and strategic aspects of the digital business. This constant flow of decentralized transactions is supported by intricate, multilevel architecture, which in turns require to be carefully monitored that becomes an essential part of guaranteeing the integrity of such systems [4,5,6]. The main purpose is to monitor the health status of its own nodes in real time, to respond to node failures in a timely manner and to guarantee the continuous and stable operation of the nodes in order to reduce the economic losses caused by node failures [7]. In the context of monitoring strategic blockchain projects, anomaly detection plays an important role in finding abnormal events and system status [8,9,10].

The monitoring data is mainly stream in time dimension, but due to the complexity of the blockchain and edge computing environment [11,12,13], the data distribution of the stream is complex and dynamically changing, which puts high demands on the adaptive capability of the anomaly detection model [14]. Therefore, this paper investigates adaptive anomaly detection method for stream data.

Stream data generally takes the form of time series, and different types of stream data differ in terms of both data representation and temporal context [15]. The data representation includes the dimensionality of data points, the type of attribute values, etc. The temporal context reflects the dependencies between different data points. For an individual data point in temporal data, the anomaly is divided into point anomaly and contextual anomaly. If the data is unordered or the dependencies can be ignored, there are no contextual relationships, and the problem can be handled as point anomaly detection. In this case, what is detected is whether the observations at each moment are anomalous in a context-free scenario. However, on time series we are usually more interested in whether each time point is anomalous in the context, and make inferences about the anomalous state of a time span wrapping the time point. This means that for any moment t on a time series, its dependence on the previous \(t-1\) moments needs to be considered. Figure 1 shows an example of a time series anomaly.

Fig. 1
figure 1

Example of anomaly detection on time series. The time series data are derived from a sine curve

The figure reveals several important features when detecting anomalies in stream data online: contextual and point anomalies, anomaly span, and distribution drift. The illustrations are as follows:

  1. 1

    Contextual anomalies: the moment \(t_3\) is a contextual anomaly because the observations at the moment \(t_3\) are out of line with the change in the curve. In contrast are \(t_1\) and \(t_2\) with the same values, which conform to the curve trend, so they are normal points.

  2. 2

    Point anomalies: At moment \(t_5\), the value deviates from all past moments, is a point anomaly.

  3. 3

    Anomaly span: Generally a period of time around the anomaly can be considered to be associated with an anomaly, so a period of time around the moment \(t_3\) can be judged as an anomaly span.

  4. 4

    Distribution drift: From moment \(t_4\), the observations of the data all change, and the changed distribution conforms to a sine curve with a larger bias. This phenomenon of distribution change is called distribution drift. Generally, in offline scenarios, it assumes that the data distribution is stable, however in the online scenario, this is the primary issue that needs to be addressed for anomaly detection to be adaptive.

Unlike problems and tasks in the conventional paradigm, anomaly detection is aimed at few, unpredictable or uncertain, rare events, which have a unique complexity that makes it difficult for general machine learning techniques to achieve better results [16,17,18]. There are high requirements for real-time online anomaly detection, however, real-time online anomaly detection of stream data faces many problems such as limited training data, distribution bias, and update frequency limitations [19, 20].

The current sequence anomaly detection methods have some shortcomings when applied to online stream data: (1) the methods are very dependent on the training data and have a cold start problem; (2) the methods using the offline training and then online update mode cannot discover the distribution drift in a timely and effective manner, and have the problem of insufficient self-adaptation; (3) the methods in online scenarios have difficulties in selecting hyperparameters such as context window size or anomaly threshold due to the limitations of training data, expert experience, etc.; (4) the existence of distribution drift leads to the lack of explanatory support as it cannot effectively distinguish whether the anomalies are generated due to distribution drift or anomaly.

In this work, we propose an adaptive stream outlier detection method (ASOD) for online anomaly detection on stream data. The key contributions of this paper can be summarized as follow:

(1) We propose the K-nearest neighbor Gaussian mixture model (KNN-GMM) that is applicable to online scenarios, and use the incremental update algorithm to eliminate the dependence on training data and achieve self-update. We also limit the impact of new data to local and improve the stability.

(2) We design a mechanism for dynamic maintenance of Gaussian components and a strategy for dynamic context control, which achieves self-adaptation to the distribution drift of stream data and reduces the dependence of hyperparameters on expert experience and validation sets.

(3) We introduce a new distance measure and automatic threshold calculation method based on Gaussian distribution and Mahalanobis Distance. The method is dimensionless that has low requirements for data preprocessing, and is applicable to different types of stream data.

(4) ASOD provides the life cycle and anomaly index of Gaussian components for continuous tracing and analysis, which explains the types and causes of anomalies to a certain extent and provides a basis for further explanation and traceability in the follow-up. Finally, the experimental results show that ASOD is effective and stable for online anomaly detection on stream data.

The rest of the paper is organized as follows. “Related works” section presents a literature review of the most related work. “Problem and overview” section gives the mathematical description of the problem and introduces the framework of ASOD. “Proposed method” section propose the KNN-GMM and online learning algorithm, followed by the online anomaly detection method under dynamic contextual scheduling control. The experiment with results are described and discussed in “Experimental results” section. “Conclusions” section summarizes the conclusion and the future work.

Related works

Most of the methods for anomaly detection are unsupervised learning due to the lack of anomaly labels. Especially in online scenarios, the effective representation, real data distribution, and context dependence of stream data are unknown and may change dynamically, manifesting as distribution drift [21, 22], which brings new challenges to the application of traditional sequence anomaly detection methods on streaming data.

The purpose of anomaly detection is to identify data points in the data that are excessively different from other data. In unsupervised anomaly detection, the threshold for determining whether a deviation is too large is generally set empirically or selected using a validation set. Existing research methods most revolve around how to quantify such deviation and can generally be grouped into the following four types.

(1) Deviation-based

The deviation represents the distance between the real and observed values, and commonly uses the L-P norm. There are two general ways: prediction and reconstruction. Where prediction is generally used for sequence data and requires that a generative model of the data points is available so that the predicted values can be used as normal values. Reconstruction is generally used for offline data or unpredictable scenarios, using the reconstructed data as the normal value. When the deviation value exceeds the threshold, the data is determined to be abnormal.

Prediction-based anomaly detection methods learn feature representations [23] that capture temporal or sequential dependencies by using historical instances within a time window to predict current instances. Normal instances usually maintain good dependencies and can be well predicted, while anomalous instances usually violate these dependencies, making them unpredictable [24, 25]. Predictive models on temporal sequences are usually designed with Multi-Layer Perceptron (MLP) [26] and Long Short-Term Memory (LSTM) [27] networks as the base network units. MLPs are a class of feed-forward artificial neural networks, and an MLP consists of at least three layers: the input layer, hidden layer, and output layer. Except for the input node, each node is a neuron using a nonlinear activation function. MLPs are trained using a supervised learning technique known as backpropagation. LSTM is one of the best variants of RNNs that use memory and gating mechanisms to handle temporal context, and it performs well in sequence-related tasks. Due to its good predictive performance, LSTM is used for anomaly detection on sequences [28, 29] with good results.

Reconstruction-based methods assume that the anomalies are incompressible or cannot be efficiently reconstructed from a low-dimensional mapping space. Some of the commonly used methods in machine learning are PCA, Robust PCA, Random Projection, and other dimensionality reduction methods [30, 31]. In deep learning, reconstruction-based anomaly detection methods usually contain an auto-encoder that consists of an encoder and a decoder. The encoder maps the original data to a low-dimensional feature space, while the decoder tries to recover the data from the projected low-dimensional space. The parameters of both networks are learned by reconstructing the loss function. In order to minimize the overall reconstruction error, the retained information must be as relevant as possible to the input instances. The use of autoencoders for anomaly detection is based on the assumption that normal instances can be reconstructed better than anomalous instances from a compressed feature space. Thus, it is possible to reconstruct data based on auto-encoders [32,33,34], where a large reconstruction error represents a large degree of anomaly. Typical examples are fully connected autoencoders (Dense AE), sparse autoencoders (Sparse AE), denoising autoencoders (Denoising AE), contractive autoencoders (Contractive AE), and robust autoencoders (Robust Deep AE) [35]. The advantage of this class of methods is the ability to capture complex features by nonlinear methods in an attempt to find a general pattern of normal instances, and the disadvantage is how to choose the correct level of compression and how to solve the problem of overfitting.

In recent years, due to the excellent performance of Generative Adversarial Networks (GANs) on complex distributions, many researchers have used them for anomaly detection, such as TadGAN [36], MAD-GAN [37], and TAnoGAN [38]. As an example, TAnoGAN is an unsupervised GAN-based anomaly detection method that first subdivides the original time series into smaller sequences and then uses GAN to learn the distribution of the smaller sequences. To process the sequence data, TAnoGAN uses LSTM as the base network unit in both the generator and discriminator. The reconstruction loss between real data and generated data includes residual loss and discriminant loss. The anomaly score of a smaller sequence is a weighted sum of these two losses.

(2) Classification-based

Classification-based methods classify data into normal and abnormal categories, and in some scenarios, the normal category can contain multiple subcategories. In unsupervised learning, there are two ways to achieve classification: cluster and statistics.

Cluster refers to the automatic cluster of data and then treats the frequent category as normal data categories. The process of cluster relies on distance, density, and other methods to measure the similarity between data. For example, distance-based K-Means [39], One-Class Support Vector Machine (OCSVM) [40], and density-based DBSCAN (Density-Based Spatial Clustering of Applications with Noise) [41], data points belonging to anomaly categories are considered as anomalies. Taking OCSVM as an example, it assumes that there is only one class of data in the training data, and then learns a minimum boundary to cover the samples. The distance from the boundary to a sample indicates how similar that sample is to that class, and samples with low similarity are considered anomalous. In addition to machine learning cluster methods, there are also cluster methods using deep learning. Generally, a neural network is used to encode the input data, and the final encoded sequence is thought to be an effective representation, and then cluster the encoded sequence to achieve the purpose of the cluster on original data.

Statistics define normal categories by statistical properties, and those that meet the statistical properties are classified as normal. This class of methods establishes distinguishing boundaries for normal data, and the outliers are classified outside the boundaries, which are represented by the Local Outlier Factor (LOF [42]) and Isolation Forest (IF [43, 44])

LOF uses a local anomaly factor as the anomaly score for each sample, mainly by comparing the density of each point and its neighboring points to determine whether the point is an anomaly and if the density of the point is lower, the more likely it is to be identified as an anomaly. As for the density, it is calculated by the distance between the points, the further the distance between the points, the lower the density, and the closer the distance, the higher the density. Since the density of points in LOF is calculated by the kth neighborhood of the points, not globally, it is named the local anomaly factor.

IF is an efficient anomaly detection algorithm, which is similar to the random forest, but each time the selection of division attributes and division points (values) is random, rather than based on information gain or Gini index. The step of dividing the data by continuous division until it is independent can be represented by a tree structure. The number of divisions, i.e., the distance from the node to the root node, is used to measure the degree of anomalies. An outlier is usually segmented in a shorter number of divisions.

(3) Distribution-based

The generative distribution of the data is described using a probability distribution, which is required to obey some known class of probability distribution. The true distribution is fitted by the sample data at first, and the data with a high probability in the distribution is considered a normal data point. When the probability of data coming from this distribution is less than threshold, the data is determined to be abnormal.

Since it is a common assumption that the data obeys Gaussian distribution, distribution models based on Gaussian distribution are usually used. Among them, Gaussian Mixture Model (GMM [45]) is the typical one, which is a linear combination of several Gaussian distribution functions. Theoretically, GMM can fit any type of distribution and is usually used to solve the case where the data under the same set contains several different distributions. Let \(\pi _k\) is a mixing factor that represents the weight of the kth Gaussian distribution and they satisfies \(\sum \limits _{k = 1}^K {{\pi _k}} = 1\). In anomaly detection, if the probability of the data to be detected in each Gaussian distribution is less than the threshold, the data is considered to be anomalous.

(4) Impact-based

The characteristics of a data set are described by statistical features or data structure, and the magnitude of the impact is measured by the change caused by the data on the characteristics. The impact caused by anomalous data is much higher than normal data, so anomalies can be filtered out by the impact threshold. The Robust Random Cut Forest (RRCF) [46]) algorithm introduced by Amazon is a representative of this type of algorithm. RRCF considers the anomaly score of a data point as the degree to which the overall tree structure is changed by the inclusion or exclusion of the point. It is an effective anomaly detection method for dynamic streaming data. It is mainly an optimization improvement of IF. RRCF designs a robust random cut tree data structure and uses it as a sketch or summary of the input stream. Then, for any given sample, its anomaly can be measured by the change in the tree as we add samples to or remove samples from the tree.

On the basis of these four methods, some other researchers have explored how to enhance anomaly detection by combining other learning approaches, with transfer learning being the main one. Transfer learning-based anomaly detection methods [47, 48] use data transfer and feature transfer to enhance anomaly detection, where data transfer expands the number of training sets by generating synthetic data through data expansion for better characterization learning of normal instances, and feature transfer can extract some characterization layers from relevant problems to improve anomaly detection model accuracy.

The ASOD method proposed in this paper is based on the KNN-GMM, which combines the advantages of two different quantification methods, classification, and distribution. It uses the online learning strategy and dynamic context control mechanism to overcome the limitations in online anomaly detection.

Problem and overview

This section first outlines the different ways for stream data anomaly detection, then gives a concrete mathematical description of the problem, and finally provides an overall view of the solution framework proposed in this paper.

Anomaly detection for stream data

In practice, anomaly detection for stream data can be divided into offline detection and online detection depending on the scenario, as shown in Fig. 2. The two types of anomaly detection differ in terms of both model and data.

Fig. 2
figure 2

Online and offline anomaly detection

Offline detection is the more common type, where historical data is used as training data to train an offline anomaly detection model that detects anomalies in the offline data.

Online detection includes hybrid model and online model. Both are deployed online and detect on real-time online data. However, the difference lies in the training data and update methods. The hybrid model uses offline historical data as training data, and the model needs to be updated periodically, so it cannot solve the problems of cold start and distribution drift. But it is suitable for scenarios with complex data and low requirements for model update as the advantage of using offline training that can obtain a more accurate discrimination model. The online model, on the other hand, uses online data as training data and learns in an incremental manner, which can adapt well to changes in distribution and has little reliance on offline data. When computational resources such as GPU or memory do not meet the requirements of offline data, the offline data can be processed in the same way as the online data and then detected by online model.

In this paper, we utilize the online working strategy and design an adaptive approach to solve the online anomaly detection problem of stream data in unsupervised scenarios.

Problem statement

For stream data, let \(\varvec{S}\) be the random variable to be observed and \(T = \{ 1,2, \cdots ,t\}\) be the ordered set of moments of length t, then the sequence of observations of S on T is

$$\begin{aligned} {\varvec{S}_T} = \left\{ {{\varvec{s}^1},{\varvec{s}^2}, \cdots ,{\varvec{s}^t}} \right\} \end{aligned}$$
(1)

In general, the anomaly detection problem on sequences requires learning such a function \(f:X \rightarrow Y\) and \(Y \in \{ 0,1\}\). The input X is the time series \({\varvec{S}_T}\) and the output is the anomaly label for each moment. \({Y^t} = 1\) means that moment t is an anomaly, otherwise, it is normal. Further, a function \(h:Y \rightarrow W\) can be constructed to detect interval anomalies in the time series based on the anomaly window size w, using the time series anomaly detection result as input, and if all moments within an interval are labeled as anomalous, then the interval is an anomaly span.

For online adaptive anomaly detection on streaming data, the function f needs to take the observations at each moment as input rather than the entire time series and should satisfy self-renewal as shown in the Eq. 2.

$$\begin{aligned} {f^t},{Y^t} = {f^{t - 1}}({\varvec{s}^t}) \end{aligned}$$
(2)

At moment t, the anomaly detection function \({f^{t - 1}}\) is known and the observation to be processed is \({\varvec{s}^t}\). \({f^{t - 1}}\) outputs the anomaly detection result for moment t and returns \({f^t}\) after learning \({\varvec{s}^t}\).

Under the influence of temporal context, the data at any moment t is dependent on the previous \(t - 1\) moments. For practical purposes, we simplify this dependency to a dependency on a finite number of past moments. The set of moments comprising this finite number of past moments and the current moment is called the context window, and the length is W. The dependency relationship is simplified as:

$$\begin{aligned} f({\varvec{s}^t}){} & {} \Leftrightarrow f\left({\varvec{s}^t}|{\varvec{s}^1},{\varvec{s}^2},...,{\varvec{s}^{t - 1}}\right) \nonumber \\{} & {} \simeq f\left({\varvec{s}^t}|{\varvec{s}^{t - W + 1}},{\varvec{s}^{t - W + 2}},...,{\varvec{s}^{t - 1}}\right) \end{aligned}$$
(3)

The data within the context window form the contextual observation vector \(\varvec{x}\), which has dimension D, i.e. \(\varvec{x} \in {R^D}\). In particular, when the sequence is one-dimensional, there is \(D = W\). At moment t, the observation vector \({\varvec{x}^{(t)}}\) corresponding to the sequence data:

$$\begin{aligned} {\varvec{x}^{(t)}} \Leftrightarrow \left\{ {{\varvec{s}^{t - W + 1}},{\varvec{s}^{t - W + 2}},...,{\varvec{s}^t}} \right\} \end{aligned}$$
(4)

By context window transformation, we use \({\varvec{x}^{(t)}}\) instead \({\varvec{s}^t}\) for temporal context dependence, then the problem is:

$$\begin{aligned} {f^t},{Y^t} = {f^{t - 1}}\left({\varvec{x}^{(t)}}\right) \end{aligned}$$
(5)

In this paper, we assume that the observation vectors obey Gaussian distribution and that the samples are independent of each other, then we have:

$$\begin{aligned} p(\varvec{x}) = \frac{1}{(2\pi )^{D/2}{\left| \Sigma \right| ^{1/2}}}e^{\left\{ { - {1 \over 2}{{(\varvec{x} - \varvec{\mu } )}^T}{\Sigma ^{ - 1}}(\varvec{x} - \varvec{\mu } )} \right\} } \end{aligned}$$
(6)

where \(\varvec{\mu }\) is the mean vector of dimension D and \(\Sigma\) is the covariance matrix of \(D \times D\). The mean and covariance of the distribution can be estimated from the observed data samples at any given moment, given that all observations can be stored and accessed. Assuming that the total number of samples is \(\mathrm{{N}}\), and since the samples are independent, by maximum likelihood estimation there are:

$$\begin{aligned} \widehat{\varvec{\mu }},\widehat{\Sigma }{} & {} = \mathop {\arg \max }\limits _{\varvec{\mu },\Sigma } \prod \limits _{i = 1}^N {p\left(\left\{ {\varvec{x}^{(i)}}\right\} |\varvec{\mu } ,\Sigma \right)} \nonumber \\{} & {} = \mathop {\arg \max }\limits _{\varvec{\mu },\Sigma } \sum \limits _{i = 1}^N {\ln p\left(\left\{ {\varvec{x}^{(i)}}\right\} |\varvec{\mu } ,\Sigma \right)} \end{aligned}$$
(7)

Using the probability density of the distribution in Eq. 6, we can get the minimization problem.

$$\begin{aligned} \widehat{\varvec{\mu }} ,\widehat{\Sigma } = \mathop {\arg \min }\limits _{\varvec{\mu } ,\Sigma } \sum \limits _{i = 1}^N {\left\{ \frac{1}{2}{\left({\varvec{x}^{(i)}} - \varvec{\mu } \right)^T}{\Sigma ^{ - 1}}\left({\varvec{x}^{(i)}} - \varvec{\mu } \right) + \frac{1}{2}\ln \left| \Sigma \right| \right\} } \end{aligned}$$
(8)

So, the mean and covariance can be estimated by:

$$\begin{aligned} \widehat{\varvec{\mu }} = {1 \over N}\sum \limits _{i = 1}^N {\varvec{x}^{(i)}} \end{aligned}$$
(9)
$$\begin{aligned} \widehat{\Sigma } = {1 \over N}\sum \limits _{i = 1}^N {(\varvec{x}^{(i)} - \widehat{\varvec{\mu }} )({\varvec{x}^{(i)} - \widehat{\varvec{\mu }} )}^T} \end{aligned}$$
(10)

In the online scenario, we usually cannot store and access all past moment observations, therefore this paper uses the distribution parameters as storable data for incremental updates, and the online form of the problem 5 is:

$$\begin{aligned} f^t,Y^t = f^{t - 1}({\varvec{x}^{(t)}}|{\varvec{\mu } ^{(t - 1)}},{\Sigma ^{(t - 1)}}) \end{aligned}$$
(11)

Framework

The adaptive stream outlier detection method for online stream data is shown in Fig. 3. It mainly includes three stages: data processing, context schedule control, and online anomaly detection.

  1. 1

    Data processing is responsible for presenting data from multiple sources, such as KPI (Key Performance Indicator) monitoring data from IoT devices, systems on the cloud, databases or log files, as data sequences, and handing them over to the context schedule control for routing to the online anomaly detection.

  2. 2

    Context schedule control contains multiple context units with different context window sizes. There are three main functions: data package, context schedule, and context stability analysis. Firstly, the data sequences affected by temporal context are converted into context-free data streams that meet the requirements of the context units, and then routed to the corresponding anomaly detection units. The context stability analysis module is responsible for receiving the anomaly detection results from the different anomaly detection units and maintaining the activation status of each context unit accordingly. The context unit in the active state will output the final anomaly detection result.

  3. 3

    Online anomaly detection contains multiple KNN-GMM units with different dimensions, corresponding to the context units in the context schedule control. After route finds the KNN-GMM unit that has the same dimension with packaged data, the online anomaly detection unit will move to the anomaly score and online update stages, using the packaged data as input. For each KNN-GMM unit, it first receives the streaming data assigned by the route, then completes the anomaly detection of the new data based on voter set search, anomaly score, and automatic threshold, and feeds the anomaly detection results back to the context schedule control. Finally, the KNN-GMM learns the new data to incremental update for a new self-updated detection model.

Fig. 3
figure 3

The overview framework of adaptive stream outlier detection method

ASOD can be deployed and used in conjunction with other offline anomaly detection methods. Online detection solves the problems of limited training data and distribution drift by providing real-time, adaptive anomaly detection, while offline detection provides more accurate anomaly detection with sufficient training data, expert experience, in-depth anomaly traceability and route cause analysis, etc.

Proposed method

The anomaly detection algorithm proposed in this paper measures the degree of anomaly by calculating the magnitude of deviation from the data distribution and determines whether the data is anomalous by a distance threshold. Since the hypothesis that contextual observations obey a Gaussian distribution, we can fit the data by Gaussian mixture model (GMM). Compared to the Gaussian model that can only represent a simple distribution with single-peaked symmetry, a GMM using a linear combination of multiple Gaussian distributions can fit complex distributions such as multi-peaked and asymmetric distributions, which are more suitable for practical scenarios. GMM works well offline, however, it suffers from the following problems when online.

  1. 1

    The GMM contains a fixed number of Gaussian components, which can be obtained offline by empirical or validation set etc. However, this fixed number is insufficient or excessive at most times when working online. Also, the data distribution faces the problem of distribution drifts, and the optimal number of Gaussian components is dynamically changing, resulting in a poorly adaptive.

  2. 2

    For new data point, all Gaussian components of GMM are updated. Such a global update strategy has a large impact when the data are outliers. A better way would be to update only the locally relevant Gaussian components to minimize the impact of the data without affecting the learning of new data.

  3. 3

    As the nature of stream data, the detection of an anomaly is a judgment made based on data from all past moments. It is impossible to decide at the current moment whether the anomaly is a data anomaly or a distribution drift, which leads to a selection dilemma of discarding and learning the data. GMM and many other models cannot handle this problem well. It is a better way to provide both retention and discard strategies, which are dynamically adjusted through online learning.

To address these problems, we propose the KNN-GMM. It is a upgraded version from GMM with incremental updates for online learning and optimized update strategy.

Core of KNN-GMM

The Gaussian component in KNN-GMM includes the following important properties in addition to the distribution parameters mean and variance.

Definition 1

(Gaussian Efficient Boundary \(\tau\)) For a given Gaussian component, a data point is said to be normal when the probability that the data is generated from this distribution is greater than the threshold, and all normal points form the effective coverage of this Gaussian component. The effective boundary represents the maximum distance between the normal points within the coverage and the distribution. The maximum distance is Efficient Boundary, denoted as tau.

When the distance between the data and the Gaussian distribution exceeds the boundary, the probability of the data that is generated from this distribution is low and can be regarded as abnormal data.

Definition 2

(Gaussian Life Cycle \(\ell\)) The lifetime of the Gaussian component means how many steps the Gaussian distribution will be valid, denoted as \(\ell\), with a default maximum lifetime of \(\ell _{\max }\). When a new instance of the data appears in the coverage, i.e. is observed by the Gaussian component, its lifetime will be extended. Each Gaussian component with a lifetime of 0 will be deleted.

There are two explanations for the deletion: (1) distribution drift, which causes the old Gaussian component to be discarded, or (2) anomaly, where the Gaussian component is created by the anomaly and is not activated for a long time and is therefore discarded.

Definition 3

(Gaussian Anomaly Index \(\beta\)) The Gaussian anomaly index indicates the anomaly degree of the Gaussian component. It decreases as more data are covered, when the number of data observed by this Gaussian component is enough, the anomaly degree is 0 and the component becomes normal. In this paper, we use the following equation to calculate \(\beta\).

$$\begin{aligned} \beta = \left\{ \begin{array}{rl} \frac{\gamma }{\gamma _{max}}, &{} \gamma > 0 \\ 0, &{} \gamma = 0 \\ \end{array}\right. \end{aligned}$$
(12)

where \(\gamma\) denotes the anomaly steps of the Gaussian component. The anomaly step of a new Gaussian component is \({\gamma _{\max }}\), and whenever new data is covered the \(\gamma\) is reduced by 1 until 0. In opposite, the normal index represents its normal degree, calculate by \(1 - \beta\).

A typical KNN-GMM is illustrated in Fig. 4. Each Gaussian component is centered on the mean point and shows a clear coverage and efficient boundary. For any data point, it comes from most k related Gaussian components. For example, the point \(\varvec{x}\) is in the coverage of \(\mathrm{{G1}}\) and \(\mathrm{{G2}}\), so the related K-nearest neighbour Gaussian components are \(\mathrm{{G1}}\) and \(\mathrm{{G2}}\). All components are dynamically maintained in KNN-GMM. For example, the \(\mathrm{{G0}}\) is discarded for two possible reasons: one is a distribution drift, where the distribution interval represented by \(\mathrm{{G0}}\) has changed and no longer exists; the other is data anomaly, where the data points covered by \(\mathrm{{G0}}\) are mostly outliers, and these outliers have failed to form a valid data distribution during the lifetime and therefore need to be discarded.

Fig. 4
figure 4

Example of KNN-GMM. There are multiple Gaussian components \(\left\{ {\mathrm{{G0,G1,G2,G3,G4}}} \right\}\) generated based on historical data. The effective Gaussian components that are in the life cycle are \(\left\{ \textrm{G1,G2,G3,G4}\right\}\). In addition, the Gaussian component \(\mathrm{{G0}}\) is discarded as it has not been activated for a long time leading to the end of its life cycle

KNN Gaussian search

For a given data point \(\varvec{x}\), the set of its KNN Gaussians should have a reasonable explanation for the generation of \(\varvec{x}\), i.e. there is a high probability that \(\varvec{x}\) is generated from the set. Therefore, in the search for KNN Gaussian, all Gaussian components are first sorted according to the distance to \(\varvec{x}\), and the candidate set contains most k Gaussian components. Then the distribution with small generation probability is excluded using the efficient boundary.

We use the Mahalanobis Distance to calculate the distance between the data points and the Gaussian component. The application of the Mahalanobis Distance presupposes that the variables should conform to a normal distribution, and the assumptions made on the data in this paper are consistent with this requirement. For a multidimensional Gaussian distribution g with mean \(\varvec{\mu }\) and covariance matrix \(\Sigma\), the Mahalanobis Distance between the data point \(\varvec{x}\) and this distribution is:

$$\begin{aligned} {d_M}(\varvec{x},g) = \sqrt{{{(\varvec{x} - \varvec{\mu } )}^T}{\Sigma ^{ - 1}}(\varvec{x} - \varvec{\mu } )} \end{aligned}$$
(13)

The Mahalanobis Distance is an effective method to calculate the similarity of two unknown sample sets. Unlike the Euclidean distance, which has a scale and treats the differences between the respective variables equally, the Mahalanobis Distance is scale-independent. Euclidean distances do not take into account correlations between variables, whereas Mahalanobis Distance eliminates correlations based on the covariance matrix. If the covariance matrix is the unit matrix, the Mahalanobis Distance reduces to the Euclidean distance.

Since the contextual observations obey a Gaussian distribution, the squares of their Mahalanobis Distances from Gaussian components obey a chi-square distribution, and this property is used to quickly compute automatic thresholds as the efficient boundary. The data \(\varvec{x}\) has a dimension D, then the square of the Mahalanobis Distance of \(\varvec{x}\) from the Gaussian distribution obeys a chi-square distribution with degree of freedom D, i.e. \({\left[ {{d_M}(\varvec{x})} \right] ^2} \sim {\chi ^2}(D)\). Taking its \(\alpha\) quantile as the threshold means that the distance to the point with the percentage of \(\alpha\) is within the threshold, i.e.

$$\begin{aligned} P\{ {\chi ^2}(D) > {\tau ^2}\} = 1 - \alpha \end{aligned}$$
(14)

For a given \(\alpha\), the efficient boundary \(\tau\) of Gaussian components can be obtained quickly. In this paper, we set \(\alpha = 0.9\) and use the efficient boundary \(\tau\) as the threshold. Then the related Gaussian components are filtered from the KNN candidates, so we finally find all KNN Gaussian components for \(\varvec{x}\). Since the data point \(\varvec{x}\) is within their coverage, \(\varvec{x}\) is a normal data point for each Gaussian component in the set.

Algorithm 1 is the pseudo code for searching KNN Gaussians.

figure a

Algorithm 1 find_knn_gaussian(\(\varvec{x}, G, k\))

Online learning of KNN-GMM

In the online scenario of stream data, the training data for the model is input batch by batch and there is no access to data from future moments. Usually, due to storage limit of the data, it is not suitable to store all historical data, and data from past moments cannot be trained in multiple iterations. The model can only be updated incrementally based on the current input batch data. This section derives an online learning algorithm for the KNN-GMM based on incremental updates to the Gaussian distribution. As the KNN-GMM model needs to be updated at each time ticks, so we use a batch size of 1, consistent with the sequence step size. Online learning of KNN-GMM consists of the following 3 main steps.

  1. 1

    For the new input data, find up to k related Gaussian components and determine its minimum local influence range, i.e. its KNN Gaussian components. This step is done by Algorithm 1.

  2. 2

    If the KNN Gaussian set is empty, add a new Gaussian component, otherwise, the affected Gaussian components are updated incrementally using the new input data.

  3. 3

    Performs state maintenance on all Gaussian components, including life cycle, anomaly index, etc., and removes all Gaussian components that have reached the end of their life cycle.

(1) Incremental update of Gaussian component

A key step in the online learning of the KNN-GMM is how to perform incremental updates to the Gaussian components. In the offline scenario, the distribution mean \(\varvec{\mu }\) and covariance \(\Sigma\) are computed from the samples using Eqs. 9 and 10. However, this method is not suitable for online scenarios. The approach used in this paper is to incrementally update the distribution parameters by learning from the input data with the current parameters.

Let \({\varvec{x}^{(i)}}\) be the observation at moment i, the current moment is t, and the mean \({\varvec{\mu } ^{(t - 1)}}\) and variance \({\Sigma ^{(t - 1)}}\) of moment \(t - 1\) are known. Then, according to Eq. 9, the mean \({\varvec{\mu } ^{(t)}}\) at moment t can be obtained as:

$$\begin{aligned} \varvec{\mu }^{(t)}{} & {} =E(\varvec{x})=\frac{1}{t} \sum \limits _{i=1}^t \varvec{x}^{(i)}=\frac{1}{t}\left( \sum \limits _{i=1}^{t-1} \varvec{x}^{(i)}+\varvec{x}^{(t)}\right) \nonumber \\{} & {} =\frac{1}{t}\left( (t-1) \varvec{\mu }^{(t-1)}+\varvec{x}^{(t)}\right) \nonumber \\{} & {} =\frac{(t-1)}{t} \varvec{\mu }^{(t-1)}+\frac{1}{t} \varvec{x}^{(t)} \end{aligned}$$
(15)

According to Eq. 10, the covariance \({\Sigma ^{(t)}}\) at moment t can be obtained as:

$$\begin{aligned} \Sigma ^{(t)}{} & {} =E\left[ (\varvec{x}-\varvec{\mu })(\varvec{x}-\varvec{\mu })^T\right] \nonumber \\{} & {} =\frac{1}{t} \sum \limits _{i=1}^t\left( \varvec{x}^{(i)}-\varvec{\mu }^{(t)}\right) \left( \varvec{x}^{(i)}-\varvec{\mu }^{(t)}\right) ^T \end{aligned}$$
(16)

Expand with Eq. 15, then:

$$\begin{aligned} \Sigma ^{(t)}{} & {} =\frac{1}{t} \sum \limits _{i=1}^t\left[ \begin{array}{c} \left( \varvec{x}^{(i)}-\varvec{\mu }^{(t-1)}\right) \\ +\frac{\varvec{x}^{(t)}-\varvec{\mu }^{(t-1)}}{-t} \end{array}\right] \left[ \begin{array}{c} \left( \varvec{x}^{(i)}-\varvec{\mu }^{(t-1)}\right) \\ +\frac{\varvec{x}^{(t)}-\varvec{\mu }^{(t-1)}}{-t} \end{array}\right] ^T \nonumber \\{} & {} =\frac{1}{t}\left\{ \begin{array}{l} \sum _{i=1}^t\left( \varvec{x}^{(i)}-\varvec{\mu }^{(t-1)}\right) \left( \varvec{x}^{(i)}-\varvec{\mu }^{(t-1)}\right) ^T \\ +\sum \limits _{i=1}^t \frac{\left( \varvec{x}^{(i)}-\varvec{\mu }^{(t-1)}\right) \left( \varvec{x}^{(t)}-\varvec{\mu }^{(t-1)}\right) ^T}{-t} \\ +\sum \limits _{i=1}^t \frac{\left( \varvec{x}^{(t)}-\varvec{\mu }^{(t-1)}\right) \left( \varvec{x}^{(i)}-\varvec{\mu }^{(t-1)}\right) ^T}{-t} \\ +\sum \limits _{i=1}^t \frac{\left( \varvec{x}^{(t)}-\varvec{\mu }^{(t-1)}\right) \left( \varvec{x}^{(t)}-\varvec{\mu }^{(t-1)}\right) ^T}{t^2} \end{array}\right\} \end{aligned}$$
(17)

Since \(\sum \limits _{i = 1}^{t - 1} {\left({\varvec{x}^{(i)}} - {\varvec{\mu } ^{(t - 1)}}\right) } = 0\), the covariance can be simplified as:

$$\begin{aligned} \Sigma ^{(t)}={} & {} \frac{1}{t}\left\{ \begin{array}{l} (t-1) \Sigma ^{(t-1)} \\ +\left( \varvec{x}^{(t)}-\varvec{\mu }^{(t-1)}\right) \left( \varvec{x}^{(t)}-\varvec{\mu }^{(t-1)}\right) ^T \\ +2 \frac{\left( \varvec{x}^{(t)}-\varvec{\mu }^{(t-1)}\right) \left( \varvec{x}^{(t)}-\varvec{\mu }^{(t-1)}\right) ^T}{-t} \\ +\textrm{t} \frac{\left( \varvec{x}^{(t)}-\varvec{\mu }^{(t-1)}\right) \left( \varvec{x}^{(t)}-\varvec{\mu }^{(t-1)}\right) ^T}{t^2} \end{array}\right\} \nonumber \\ ={} & {} \frac{1}{t}\left\{ \begin{array}{l} (t-1) \Sigma ^{(t-1)} \\ +\frac{t-1}{t}\left( \varvec{x}^{(t)}-\varvec{\mu }^{(t-1)}\right) \left( \varvec{x}^{(t)}-\varvec{\mu }^{(t-1)}\right) ^T \end{array}\right\} \end{aligned}$$
(18)

Thus, we obtain the formula for incremental update based on the distribution parameters of Gaussian components and the input at the current time. Storing the mean and variance information of each Gaussian component is sufficient to satisfy the incremental update requirement.

(2) Incremental update of KNN-GMM For new data point, there are two possible operations for incremental update: create and update, corresponding to the two cases where the set of KNN Gaussian components is empty or non-empty. A typical example of the incremental update for KNN-GMM is shown in Fig. 5.

  1. 1

    Update: For the added data point \({\varvec{x}^{(i)}}\), it is outside the efficiency boundary of G3 and G4 and inside the efficiency boundary of G1 and G2, so its KNN Gaussian components is \(\left\{ {G1,G2} \right\}\). Thus, G1 and G2 will be incrementally updated with \({\varvec{x}^{(i)}}\) as input, as shown by the fact that the mean points of both distributions will approach the point \({\varvec{x}^{(i)}}\).

  2. 2

    Create: For the new data point \({\varvec{x}^{(j)}}\), a new Gaussian component G5 is created as the generating distribution for \({\varvec{x}^{(j)}}\), as it is not within the efficient boundary of any existing Gaussian component and the KNN Gaussian components is empty. The mean of G5 is \({\varvec{x}^{(j)}}\) and the covariance is \(\varvec{0}\).

Fig. 5
figure 5

An incremental update example of KNN-GMM. \(\left\{ G1,G2,G3,G4\right\}\) are effective Gaussian components, \({\varvec{x}^{(i)}}\) and \({\varvec{x}^{(j)}}\) are the new data points observed at moments i, j (\(i < j\)), and we set \(k = 10\)

In summary, the process for online learning of the KNN-GMM is shown as Algorithm 2.

figure b

Algorithm 2 online_knn_gmm(\(\varvec{x}^{(t)}, G^{(t - 1)}, k\))

Anomaly score

A KNN-GMM consists of multiple Gaussian distributions in same dimension. The stream observations are derived from up to k neighbor Gaussian. From a classification perspective, each Gaussian component can be viewed as a class, and the probability that the data belongs to each class is equivalent to the probability that it arises from each distribution. So anomaly detection can be treated as a multi-class classification problem, i.e. whether or not it belongs to a normal class. The closer a data point is to a Gaussian component the more likely it is to belong to that class. An anomaly is often far away from all normal Gaussian components or within the coverage of an anomalous Gaussian component. Therefore, multiple Gaussian components can be found to form a decision set to vote on the observed data to complete the anomaly evaluation.

For voter sets, it is necessary to include sufficient number of Gaussian components with valid categorical information to determine the class of a new data point and to measure its degree of anomaly. There are two constraints on the validity of the classification information: distance and anomaly index. Gaussian components that are close give more accurate classification information, as the data are more likely to be generated from that distribution. A Gaussian with a low anomaly index gives a “non-anomalous” classification with more confidence, because a Gaussian with high anomaly index that gives a “normal” classification to a data point indicates a high probability that the data is anomalous, and there must be a more distant Gaussian component with a low anomaly index gives a better determination of the anomaly degree. In addition, since a Gaussian component with a low anomaly index may also be far away from the data point, the data point may be out of its efficient boundary and gives the classification result of the anomaly. So distance has high order over the anomaly index in the search of voter set.

The search starts from the Gaussian distribution with the closest distance to the data point until there are no more unsearched Gaussian components or until a Gaussian component with an anomaly index of 0 is found. In the search, the first normal distribution gives the most efficient classification information for a given data point and is denoted as \({g_{best}}\). This is because for Gaussians further away there are two cases: one is a normal distribution with a large distance and the other is a non-normal distribution. So, \({g_{best}}\) is the most important element in the voter set, and in a search strategy sorted by distance, we can stop the search when reaching \({g_{best}}\).

It is important to note that the voter set is not equivalent to the set of KNN Gaussian components. KNN Gaussian components may be empty, and Gaussian components in the voter set may not be in the set. Two typical examples of voter set searching are shown in Fig. 6,

  1. 1

    No KNN Gaussians: for the new data point \({\varvec{x}^{(m)}}\), its KNN Gaussian set is empty and the search path is \(G2 \rightarrow G1 \rightarrow G3\). Since the anomaly index of G3 is 0, the final voter set is \(\left\{ {G1,G2,G3} \right\}\). The intersection of the voter set for \({\varvec{x}^{(m)}}\) with the set of KNN Gaussians is empty.

  2. 2

    With KNN Gaussians: for the new data point \({\varvec{x}^{(n)}}\), the set of KNN Gaussians is \(\left\{ {G6,G4} \right\}\) and the search path is \(G6 \rightarrow G4 \rightarrow G5\). Although \({\varvec{x}^{(n)}}\) is within the coverage of G6 and G4, they both have the anomaly index greater than 0, so the search continues until the normal component G5 is found.

Fig. 6
figure 6

Search voters for anomaly score. Each Gaussian components show their anomaly index after name. Where \(\left\{ G1,G2,G3,G4,G5,G6\right\}\) are effective Gaussian components, \({\varvec{x}^{(m)}}\) and \({\varvec{x}^{(n)}}\) are new data points at moments m and n ( \(m < n\))

The algorithm for finding the anomaly voter set in is shown below.

figure c

Algorithm 3 find_local_voters(\(\varvec{x}, G\))

In the anomaly vote stage, the higher the anomaly index the smaller the weight of the Gaussian component in the voter set. The vote weight of each Gaussian component in the voter set is normalized by the normal index and is calculated as:

$$\begin{aligned} {\varvec{w}_i} = \frac{{1 - {\beta _i}}}{{\sum \limits _{i = 1}^{\left| \Lambda \right| } {\left( {1 - {\beta _i}} \right) } }} = \frac{{1 - \frac{{{\gamma _i}}}{{{\gamma _{max}}}}}}{{\sum \limits _{i = 1}^{\left| \Lambda \right| } {\left( {1 - \frac{{{\gamma _i}}}{{{\gamma _{max}}}}} \right) } }} \end{aligned}$$
(19)

Finally, the KNN-GMM scores the anomaly for given data \(\varvec{x}\) by:

$$\begin{aligned} score\left( \varvec{x} \right) = \sum \limits _{i = 1}^{\left| \Lambda \right| } {{\varvec{w}_i}\frac{{{D_M}(\varvec{x}, {\Lambda _i})}}{\tau }} \end{aligned}$$
(20)

where \({\Lambda _i}\) denotes the ith Gaussian component in the voter set. The anomaly score less than 1 are normal data. Therefore, the final formula for calculating the anomaly label is:

$$\begin{aligned} y = \max (sign(score\left( \varvec{x} \right) - 1),0) \end{aligned}$$
(21)

Online anomaly detection

Online anomaly detection for stream data starts with converting the streaming data into contextual observations, followed by anomaly detection and incremental learning of the contextual observations using KNN-GMM. The focus of the converting is choosing a appropriate contextual window size and data assembly method. For the KNN-GMM, the dimension of the data is a model hyperparameter related to the size of the context window, which is generally selected by empirical settings or validation sets when offline, while it faces the limitation of limited validation sets and low empirical availability when online. Therefore, this paper designs a dynamic context schedule control mechanism for adaptively selecting and resizing the context window online.

A KNN-GMM model of a specified dimensional size is called a KNN-GMM unit, which is the basic component of the context schedule control. Dynamic context schedule control operates on multiple context units, each corresponding to a KNN-GMM unit.

Definition 4

(Context Stability Index \(\omega\)) The Context Stability Index indicates how stable a context unit is over the recent time step, inversely proportional to the number of anomalies. Each context unit uses an evaluation window to count the number of anomalies that occur, and accordingly calculates the stability within this context window.

Let \(\Omega\) be the evaluation window size and the number of anomalies within the window be \(\varepsilon\). The context stability index of a context unit with dimension d is:

$$\begin{aligned} {\omega _d} = 1 - \frac{{{\varepsilon _d}}}{\Omega } \end{aligned}$$
(22)

Context schedule control includes data package, context schedule and context stability analysis.

  1. 1

    Data package: depending on the size of the context window, streaming data is converted into a context-free data stream according to the assembly method. For example, a single value time series can be viewed as a high-dimensional vector within the context window. This data stream is routed to find the KNN-GMM unit of the corresponding dimension for anomaly detection and incremental update. This paper uses the default list of context window sizes as \(\left\{ {{2^0},{2^1},{2^2},{2^3},{2^4},{2^5},{2^6},{2^7}} \right\}\).

  2. 2

    Context schedule: responsible for maintaining the activation status of context units, only the active context units can participate in the final determination of anomaly detection results. Context units are divided into two categories: context-free units and dynamic context units. Context-free units, which are context units with a context window size of 1, are always active. Obviously, once a context-free unit detects an exception, then all scenarios that consider context must be exceptions. Dynamic context units, which represent context units with a context window size greater than 1, always remain active for at most one. An activated dynamic context unit has the largest stability index in the list of dynamic context units and exceeds the stability threshold. We set the threshold to 0.90 in this paper, which ensures that the activated context unit is valid. It can be seen that both context-free and dynamic contexts are supported by dynamic scheduling.

  3. 3

    Context stability analysis: at each time step, the anomaly status within the evaluation window is maintained and the context stability index is calculated for each context unit based on the anomaly detection results of the context unit.

At each step of stream processing, the data is first packaged according to a pre-defined list of context units, and the corresponding KNN-GMM unit is found through the route to complete anomaly detection and incremental update. Afterwards, the KNN-GMM unit feeds the anomaly detection result to the context stability analysis module, triggering the update of the context stability index. If any of the activated context units detect an anomaly, the final result is anomaly, otherwise normal. Finally, the context schedule module performs context unit activation and deactivation operations based on the stability of each context unit.

Experimental results

This section describes the datasets, baseline methods, and important settings used for the experiment. It also performs a comprehensive analysis of the anomaly detection results and comparison of baseline methods.

Datasets

To measure the performance of ASOD, we evaluate it on NAB datasets [49, 50]. NAB is a quality collection of both real-world and artificial time series data with labeled anomalous spans of behavior. It contains 58 time series data files in 7 categories and is designed to provide data for research in streaming anomaly detection. The majority of the data is real-world from a variety of sources such as AWS server metrics, Twitter volume, advertisement clicking metrics, traffic data, and more. We select all the real data that has 5 categories, total 47 signals, 321, 206 data points, and 31, 077 anomaly points.

Each file in NAB contains two columns: timestamp and value, the value column is one-dimension sequence data. The NAB also provides two kinds of anomaly labels: anomaly point and anomaly span, the anomaly point is within the anomaly span, and the anomaly span is marked by beginning and ending timestamp. We use the anomaly span as anomaly label, because anomalous data often occurs over time rather than at a single point. We train models that need training in an unsupervised way and the anomaly labels are only used as ground truth when evaluating the detection performance. Information for each dataset is summarized in Table 1. For each dataset, we present the total number of signals and the amount of anomaly point and span. It can be seen that the abnormal rate are all about 9.5%.

Table 1 Dataset summary for each dataset in NAB

Experimental setup

The time series data in the dataset will be fed into the anomaly detection algorithm in a stream way. For all methods that work offline and need training, we find that if the number of training sets used for offline training is not enough, they will produce a large number of false positives for time series with distribution drift, thus leading to a low F1-score. Therefore, to compare the performance of online methods such as ASOD, for these methods we use 100% of the time series data as offline data. In offline training, we use 80% of the offline data as training dataset and another 20% as a validation dataset to achieve their best performance (Table 2).

Table 2 Summary of baseline methods and ASOD

We run the experiments on a server with 6\(\times\)Intel(R) Xeon(R) CPU E5-2678 v3@2.50GHz, 30G RAM and 1\(\times\)NVIDIA RTX A2000 GPU. All algorithms are implemented with Python 3.7, and deep learning methods like LSTM, MLP, DenseAE, etc. are built on CUDA 11.1, cuDNN 8.0.5, and Pytorch 1.8.1. Experiments of all algorithms are conducted under the same condition.

  • The dependency window size of the time series is set to \(w=19\), which means that 19 past moments are used to predict future moments.

  • The window size is \(w+1=20\) for detection methods using the small window as input.

  • For ASOD, we set \(k=5\) in KNN-GMM. It means that there are maximum 5 Gaussian components in the nearest Gaussian sets.

  • To transform the input stream into input vector, the stride is set to 3 in the training stage, and 1 in the test stage.

Anomaly detection result by ASOD

We use Figs. 7 and 8 to illustrate the anomaly detection results of the ASOD on two typical time series that with or without distribution drift. Where the real curves are the original observed stream, the top half of the graph contains the anomaly points and anomaly spans detected by ASOD, and the bottom half contains the true labels of the anomaly.

The data stream used in Fig. 7 is observed from a vehicle speed, where a low speed generally indicates a traffic jam or an accident, and the overall distribution of the sequence is relatively stable. In comparison with the real anomaly labels, ASOD finds out all the anomaly spans. Although there are false positives, they are as expected and can be explained from a streaming data perspective: before moment 300, which can be considered the warm-up phase of the model, the anomalies are conclusions based on the limited data from the beginning phase, so it is normal to have false positives. There are also two anomaly spans detected between moments 600 and 800 which are not labeled as the anomaly. However, they are reasonable to be considered as the anomaly in terms of the data series, as the sudden decrease and increase that is consistent with the behavior of the other labeled anomaly intervals.

Fig. 7
figure 7

Anomaly detection result by ASOD (signal name: speed_7578). The gray spans in bottom are true labels of the anomaly span

Figure 8 shows the monitoring data of CPU occupancy on the cloud server, which is greatly affected by time and business scenarios, etc. It can be seen that there is a significant distribution drift around moment 3000. ASOD detected two abnormal spans around moments 1300 and 3000 respectively, which are basically consistent with the real anomaly labels. In addition, it adjusts for the overall decrease in occupancy after moment 3000, so the moments after 3000 are not identified as the anomaly. We can explain it from the perspective of the KNN-GMM. The anomaly index of the Gaussian component added by the anomaly point gradually decreases to 0 within the anomaly span, and then acts as \(g_{best}\) in the subsequent anomaly determination which terminates the search of the anomaly voters and participates in the anomaly vote. This reveals that the ASOD method has good adaptive property and can effectively address the distribution drift problem that exists in streaming data.

Fig. 8
figure 8

Outlier detection result by ASOD (signal name: ec2_cpu_utilization_5f5533). The gray spans in bottom are ground truth of the anomaly

Comparision of baselines and ASOD

Since the number of positives in anomaly detection is much smaller than negatives and the identification of positives is more important, we use F1-score and Recall to evaluate the performance of anomaly detection methods.

Table 3 shows the F1-score of baselines and ASOD on 5 real categories of the NAB dataset, where the F1-score obtained in each category is the mean of all time series under that category. It should be noted that for the anomaly detection methods RRCF and ASOD, which use online learning, the characteristics of online streaming data result in false positives at the initial stage of the data sequence, and therefore their performance are higher for actual online data streams than what is evaluated for them in this paper.

Table 3 F1-score of baselines and ASOD on NAB. (offline percentage = 100%)

As can be seen from the table, the ASOD method achieves a near-optimal F1-score on the entire dataset (0.5002), close to LSTM (0.5384) and MLP (0.5218), with a 6% improvement compared to the average F1 (0.4402). In addition, it ranks first in several categories of data, with the highest F1-score in the advertisement, a 5% improvement compared to the second best LSTM.

ASOD can be treated as an online optimized version of GMM and has the same online work strategy as RRCF for streaming data. It achieves an 8.1% improvement over GMM (0.4192) which works offline, illustrating the effectiveness of the optimization proposed by KNN-GMM for online detection. It also improves the F1-score by 8.43% compared to RRCF (0.4159) for online detection, illustrating the effectiveness of ASOD in contextual scheduling control and nearest neighbor optimization.

By refining the statistics for each category to specific data streams, a comparison of the F1-score for ASOD, GMM, and RRCF on the 47 data streams is shown in Fig. 9. It can be seen that under the evaluation of the F1-score, ASOD performs better than GMM and RRCF in terms of leading counts, with the distribution more concentrated in the area that performs better than others.

Fig. 9
figure 9

F1 rank Comparison of ASOD, GMM, and RRCF. The x coordinate indicates the number of F1-score achieved by a given method to better other detection methods on a dataset, and the y coordinate indicates the number of the dataset where the method achieved that number of leads

Under the same experimental conditions, Table 4 shows the recall of the baselines and ASOD. It can be seen that the ASOD achieves a near optimal recall (0.6939) on the whole dataset, close to LSTM (0.7301) and MLP (0.6954). These three methods are ahead of the other methods in terms of recall. ASOD improve 20.3% compared to the average recall (0.4909). In addition, it achieves the highest recall in the traffic volume category (0.9167) with a 2% improvement compared to the second best LSTM.

Table 4 Recall of baselines and ASOD on NAB. (offline percentage = 100%)

The effect of temporal context in different types of data series are different and the data distribution varies greatly, thus having a greater impact on the cross-dataset stability of detection methods. We select the 5 best methods: LSTM, MLP, ASOD, TAnoGAN, and DenseAE, to evaluate their stability over different categories, as shown in Fig. 10. The stability is expressed as the standard deviation of the F1-score. As can be seen from the figure, the F1-score of all five methods are higher than the mean F1-score. ASOD has the lowest standard deviation of F1 while achieving near optimal anomaly detection results. The low performance fluctuation indicates better stability and adaptiveness to different types of data stream.

Fig. 10
figure 10

Stability comparison of top 5 methods in metric F1-score

It is important to note that for methods that require training, we provide sufficient training data, which is equivalent to offline detection. In the face of time series with distribution drift, as shown in Fig. 8, they usually perform worse in the hybrid mode than offline detection in the experiments. When the amount of offline data is insufficient, a degradation in the performance of these methods can be observed, Table 5 shows the degradation of LSTM as an example. Therefore, the fact that ASOD achieves close best performance also indicates the effectiveness of ASOD in online learning and anomaly detection. More importantly, we no longer need to update the model frequently in scenarios where the data distribution is drifted, and we can also solve the cold start problem.

Table 5 Performance comparison of LSTM with different offline percentage on NAB

Discussion

The comparative analysis of the experimental results show us that for streaming data, ASOD can effectively perform online anomaly detection. The overall performance is very close to the optimal methods which has sufficient training data and works offline. Here are analysis of factors affecting the anomaly detection performance from the perspective of data and model hyperparameters.

  1. 1

    Real data distribution There are diverse data distribution of different type time series, which put various requirements on the model architecture and type, and raise difficulty for a general anomaly detection model. ASOD proposes a K-nearest neighbor Gaussian mixture model that dynamically maintains the Gaussian component, and achieves good performance on different types of data by limiting the impact of new coming data points to local. The performance of ASOD has minimal variance and shows good stability.

  2. 2

    Distribution drift In the online scenario, the data stream has limitations on the training data and model parameters, mainly due to the inability to store all time series data and observe future data. Thus, we can not effectively sample over the real data distribution, which results in the inability to resolve the distribution drift present on the data stream, so the hybrid models require regular updates. To this end, ASOD designs strategy for dynamically adding or removing Gaussian components, and also adds life cycle property to the Gaussian components to support online update and self-adaptation to distribution drifts. In addition, dynamic contextual schedule control reduces the temporal context-dependent changes caused by distribution drift.

  3. 3

    Threshold Threshold has no affect on the scoring of anomaly, but it determines the labeling of anomaly point and anomaly span. Whether it is an anomaly depends on more prior knowledge. It is a domain problem that what degree of deviation should be identified as an anomaly in different scenarios and is more relevant to expert knowledge. The automatic threshold method proposed in this paper is a reasonable way based on statistical information from the data, and it can be adjusted in conjunction with expert knowledge in specific applications.

In addition, ASOD facilitates the interpretation and analysis of anomalies. In general, anomaly detection results for stream data can be interpreted in two ways: the true data distribution has produced a drift, or the observed data is an anomaly. In the detection of ASOD, anomalies will be accompanied by: the creation of Gaussian components, and the weight of Gaussian components with a high anomaly index in the voter set greater than \(g_{best}\). By tracking the K-nearest neighbor Gaussian components of the anomaly and the relevant Gaussian component created by the anomaly, we can find that if the anomaly index of the relevant Gaussian component drops to 0, it indicates a distribution drift, and if the life cycle of the Gaussian component ends while its anomaly index is greater than 0, it indicates an anomaly which is not caused by distribution drift. Therefore, we can distinguish between these two types of anomalies by continuously analyzing the life cycle and anomaly index of the Gaussian components associated with the anomaly, which makes the determination of anomaly more reasonable and provides a basis for further anomaly interpretation and anomaly tracing.

Conclusions

In this paper, we study the problem of online anomaly detection for stream data and propose ASOD to overcome the limitations faced by traditional methods: limited training data, distribution drift, and limited update frequency. Based on the characteristics of streaming data, ASOD applies the incremental update algorithm to eliminate the reliance on the training data and realize online update. At the same time, it dynamically maintains Gaussian components according to the K-nearest neighbors and the life cycles. Combined with the strategy of dynamic context control, it is adaptive to distribution drifts. The distance metric used in ASOD is a dimensionless method based on Mahalanobis Distance, which requires less data preprocessing and is applicable to different types of stream data.

By applying the K-nearest neighbor approach, ASOD limits the impact from new data to local, reducing the global impact of anomalies and improving the stability of the method. The whole detection process is automatically adjusted through context schedule control and has less reliance on the hyperparameters which need sufficient experience or validation sets. ASOD provides the life cycle and anomaly index of Gaussian components for continuous tracking and analysis, which to a certain extent accounts for the type and cause of the anomaly and provides a basis for further subsequent interpretation and traceability. The experimental results show that ASOD achieves near-optimal F1-score and recall on the NAB dataset, while has the lowest F1-variance of the five most effective detection methods.

As a future work, for better root cause analysis, we will explore the subsequent anomaly interpretation and traceability issues of streaming data in depth. In addition, as we ignore the relation between time series and there are limitations in anomaly detection for multi-variate time series, we will also investigate the optimization of ASOD in such scenarios where temporal and spatial contexts both exist.

Availability of data and materials

The datasets analyzed for this study can be found in the https://github.com/numenta/NAB/wiki. The source code used to support the findings of this study are available from the corresponding author upon request.

References

  1. Deepa N, Pham QV, Nguyen DC, Bhattacharya S, Prabadevi B, Gadekallu TR et al (2022) A survey on blockchain for big data: approaches, opportunities, and future directions. Future Gener Comput Syst 131:209–226

    Article  Google Scholar 

  2. Mirsky Y, Golomb T, Elovici Y (2020) Lightweight collaborative anomaly detection for the IoT using blockchain. J Parallel Distrib Comput 145:75–97

    Article  Google Scholar 

  3. Du J, Cheng W, Lu G, Cao H, Chu X, Zhang Z, Wang J (2022) Resource pricing and allocation in mec enabled blockchain systems: An a3c deep reinforcement learning approach. IEEE Trans Netw Sci Eng 9(1):33–44

    Article  MathSciNet  Google Scholar 

  4. Sayadi S, Rejeb SB, Choukair Z (2019) Anomaly detection model over blockchain electronic transactions. In 2019 15th international wireless communications & mobile computing conference (IWCMC). IEEE, p 895–900

  5. Zheng P, Zheng Z, Luo X, Chen X, Liu, X (2018) A detailed and real-time performance monitoring framework for blockchain systems. In Proceedings of the 40th international conference on software engineering: software engineering in practice, p 134–143

  6. Lu T, Dai H, Wang B (2018) QoE-orientated resource allocation for wireless VR over small cell networks. In: 2018 10th International Conference on Wireless Communications and Signal Processing (WCSP), pp 1–6. https://doi.org/10.1109/WCSP.2018.8555683

  7. Bogner A (2017) Seeing is understanding: anomaly detection in blockchains with visualized features. In: Proceedings of the 2017 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2017 ACM International Symposium on Wearable Computers. ACM, Maui Hawaii, pp 5–8

  8. Soldani J, Brogi A (2022) Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey. ACM Comput Surv (CSUR) 55(3):1–39

    Article  Google Scholar 

  9. Ferrag MA, Maglaras L, Moschoyiannis S, Janicke H (2020) Deep learning for cyber security intrusion detection: Approaches, datasets, and comparative study. J Inf Secur Appl 50(102):419

    Google Scholar 

  10. Hassan MU, Rehmani MH, Chen J (2022) Anomaly detection in blockchain networks: a comprehensive survey. IEEE Commun Surv Tutor 25(1):289–318

    Article  Google Scholar 

  11. Xiao H, Cai L, Feng J, Pei Q, Shi W (2023) Resource optimization of MAB-based reputation management for data trading in vehicular edge computing. IEEE Trans Wirel Commun 22(8):5278–5290

    Article  Google Scholar 

  12. Feng J, Zhang W, Pei Q, Wu J, Lin X (2022) Heterogeneous computation and resource allocation for wireless powered federated edge learning systems. IEEE Trans Commun 70(5):3220–3233

    Article  Google Scholar 

  13. Yu J, Alhilal A, Zhou T, Pan H, Tsang DH (2023) Attention-based qoe-aware digital twin empowered edge computing for immersive virtual reality. arXiv preprint arXiv:2305.08569

  14. Ahmed A, Sajan KS, Srivastava A, Wu Y (2021) Anomaly detection, localization and classification using drifting synchrophasor data streams. IEEE Trans Smart Grid 12(4):3570–3580

    Article  Google Scholar 

  15. Feng Y, Liu Z, Chen J, Lv H, Wang J, Yuan J (2022) Make the rocket intelligent at iot edge: Stepwise gan for anomaly detection of lre with multisource fusion. IEEE Internet Things J 9(4):3135–3149

    Article  Google Scholar 

  16. Chang YY, Li P, Sosic R, Afifi MH, Schweighauser M, Leskovec J (2021) F-fade: Frequency factorization for anomaly detection in edge streams. In Proceedings of the 14th ACM international conference on web search and data mining, p 589–597

  17. Eswaran D, Faloutsos C, Guha S, Mishra N (2018) Spotlight: detecting anomalies in streaming graphs. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, p 1378–1386

  18. Wu W, Li R, Xie G, An J, Bai Y, Zhou J, Li K (2019) A survey of intrusion detection for in-vehicle networks. IEEE Trans Intell Transp Syst 21(3):919–933

    Article  Google Scholar 

  19. Togbe MU, Chabchoub Y, Boly A, Barry M, Chiky R, Bahri M (2021) Anomalies detection using isolation in concept-drifting data streams. Computers 10(1):13

    Article  Google Scholar 

  20. Han S, Wu Q, Zhang H, Qin B, Hu J, Shi X, Liu L, Yin X (2021) Log-based anomaly detection with robust feature extraction and online learning. IEEE Trans Inf Forensic Secur 16:2300–2311

    Article  Google Scholar 

  21. Jain M, Kaur G, Saxena V (2022) A k-means clustering and svm based hybrid concept drift detection technique for network anomaly detection. Expert Syst Appl 193:116510

    Article  Google Scholar 

  22. Nisioti A, Mylonas A, Yoo PD, Katos V (2018) From intrusion detection to attacker attribution: a comprehensive survey of unsupervised methods. IEEE Commun Surv Tutor 20(4):3369–3388

    Article  Google Scholar 

  23. Schmidhuber J (2015) Deep learning in neural networks: An overview. Neural Netw 61:85–117

    Article  Google Scholar 

  24. Ji SY, Jeong BK, Choi S, Jeong DH (2016) A multi-level intrusion detection method for abnormal network behaviors. J Netw Comput Appl 62:9–17

    Article  Google Scholar 

  25. Yan X, Zhang H, Xu X, Hu X, Heng PA (2021) Learning semantic context from normal samples for unsupervised anomaly detection. Proceedings of the AAAI Conference on Artificial Intelligence 35:3110–3118

    Article  Google Scholar 

  26. Farzad A, Gulliver TA (2022) Log message anomaly detection with fuzzy C-means and MLP. Appl Intell 52(15):17708–17717

    Article  Google Scholar 

  27. Graves A, Graves A (2012) Long short-term memory. Supervised sequence labelling with recurrent neural networks, 37–45

  28. Ergen T, Kozat SS (2019) Unsupervised anomaly detection with lstm neural networks. IEEE Trans Neural Netw Learn Syst 31(8):3127–3141

    Article  MathSciNet  Google Scholar 

  29. Ding L, Fang W, Luo H, Love PE, Zhong B, Ouyang X (2018) A deep hybrid learning model to detect unsafe behavior: Integrating convolution neural networks and long short-term memory. Autom Constr 86:118–124

    Article  Google Scholar 

  30. Jove E, Casteleiro-Roca JL, Quintián H, Méndez-Pérez JA, Calvo-Rolle JL (2021) A new method for anomaly detection based on non-convex boundaries with random two-dimensional projections. Inf Fusion 65:50–57

    Article  Google Scholar 

  31. Vaswani N, Bouwmans T, Javed S, Narayanamurthy P (2018) Robust subspace learning: Robust pca, robust subspace tracking, and robust subspace recovery. IEEE Signal Proc Mag 35(4):32–55

    Article  Google Scholar 

  32. Thill M, Konen W, Wang H, Bäck T (2021) Temporal convolutional autoencoder for unsupervised anomaly detection in time series. Appl Soft Comput 112:107751

    Article  Google Scholar 

  33. Borghesi A, Bartolini A, Lombardi M, Milano M, Benini L (2019) Anomaly detection using autoencoders in high performance computing systems. Proceedings of the AAAI Conference on Artificial Intelligence 33:9428–9433

    Article  Google Scholar 

  34. Gao H et al (2022) Tsmae: a novel anomaly detection approach for internet of things time series data using memory-augmented autoencoder. IEEE Trans Netw Sci Eng 10(5):2978–2990

    Article  MathSciNet  Google Scholar 

  35. Han HG, Zhang HJ, Qiao JF (2020) Robust deep neural network using fuzzy denoising autoencoder. Int J Fuzzy Syst 22(4):1356–1375

    Article  Google Scholar 

  36. Geiger A, Liu D, Alnegheimish S, Cuesta-Infante A, Veeramachaneni K (2020) Tadgan: Time series anomaly detection using generative adversarial networks. In: 2020 IEEE International Conference on Big Data (Big Data), IEEE, pp 33–43

  37. Li D, Chen D, Jin B, Shi L, Goh J, Ng SK (2019) Mad-gan: Multivariate anomaly detection for time series data with generative adversarial networks. In: International conference on artificial neural networks. Springer, pp 703–716

  38. Bashar MA, Nayak R (2020) Tanogan: Time series anomaly detection with generative adversarial networks. In: 2020 IEEE Symposium Series on Computational Intelligence (SSCI). IEEE, pp 1778–1785

  39. Wazid M, Das AK (2016) An efficient hybrid anomaly detection scheme using k-means clustering for wireless sensor networks. Wirel Pers Commun 90(4):1971–2000

    Article  Google Scholar 

  40. Schölkopf B, Williamson RC, Smola A, Shawe-Taylor J, Platt J (1999) Support vector method for novelty detection. Adv Neural Inf Process Syst 12

  41. Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd (Vol. 96, No. 34), p 226–231

  42. Breunig MM, Kriegel HP, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p 93–104

  43. Liu FT, Ting KM, Zhou ZH (2008) Isolation forest. In: 2008 eighth ieee international conference on data mining. IEEE, pp 413–422

  44. Liu FT, Ting KM, Zhou ZH (2012) Isolation-based anomaly detection. ACM Trans Knowl Discov Data 6(1):1–39

    Article  Google Scholar 

  45. Reynolds DA (2009) Gaussian mixture models. Encyclopedia of biometrics 741, p 659–663

  46. Guha S, Mishra N, Roy G, Schrijvers O (2016) Robust random cut forest based anomaly detection on streams. In: International conference on machine learning. PMLR, pp 2712–2721

  47. Xu C, Wang J, Zhang J, Li X (2021) Anomaly detection of power consumption in yarn spinning using transfer learning. Comput Ind Eng 152:107015

    Article  Google Scholar 

  48. Michau G, Fink O (2021) Unsupervised transfer learning for anomaly detection: application to complementary operating condition transfer. Knowl-Based Syst 216(106):816

    Google Scholar 

  49. Ahmad S, Lavin A, Purdy S, Agha Z (2017a) The numenta anomaly benchmark [white paper]. https://github.com/numenta/NAB/wiki. Accessed 10 Oct 2022

  50. Ahmad S, Lavin A, Purdy S, Agha Z (2017) Unsupervised real-time anomaly detection for streaming data. Neurocomputing 262:134–147

    Article  Google Scholar 

Download references

Funding

This work is supported by the National Key R &D Program of China (No. 2018YFB1800702).

Author information

Authors and Affiliations

Authors

Contributions

Zhichao Hu: Conceptualization, Methodology, Software, Investigation, Formal Analysis, Writing - Original Draft; Xiangzhan Yu: Resources, Supervision; Likun Liu: Visualization, Investigation; Yu Zhang: Data Curation, Writing - Original Draft; Haining Yu: Visualization, Writing - Review & Editing.

Corresponding author

Correspondence to Haining Yu.

Ethics declarations

Ethics approval and consent to participate

This research does not consider any of human and/or animal studies which may cause any ethical concern, thus the research does not need to recieve ethical approvals.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, Z., Yu, X., Liu, L. et al. ASOD: an adaptive stream outlier detection method using online strategy. J Cloud Comp 13, 120 (2024). https://doi.org/10.1186/s13677-024-00682-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13677-024-00682-0

Keywords