An autonomic prediction suite for cloud resource provisioning
 Ali Yadavar Nikravesh^{1},
 Samuel A. Ajila^{1}Email author and
 ChungHorng Lung^{1}
DOI: 10.1186/s1367701700734
© The Author(s). 2017
Received: 25 August 2016
Accepted: 19 January 2017
Published: 3 February 2017
Abstract
One of the challenges of cloud computing is effective resource management due to its autoscaling feature. Prediction techniques have been proposed for cloud computing to improve cloud resource management. This paper proposes an autonomic prediction suite to improve the prediction accuracy of the autoscaling system in the cloud computing environment. Towards this end, this paper proposes that the prediction accuracy of the predictive autoscaling systems will increase if an appropriate timeseries prediction algorithm based on the incoming workload pattern is selected. To test the proposition, a comprehensive theoretical investigation is provided on different risk minimization principles and their effects on the accuracy of the timeseries prediction techniques in the cloud environment. In addition, experiments are conducted to empirically validate the theoretical assessment of the hypothesis. Based on the theoretical and the experimental results, this paper designs a selfadaptive prediction suite. The proposed suite can automatically choose the most suitable prediction algorithm based on the incoming workload pattern.
Keywords
Cloud resource provisioning Autoscaling Decision fusion technique Structural risk minimization Empirical risk minimization Multilayer perceptron Multilayer perceptron with weight decay Workload pattern Cloud computingIntroduction
The elasticity characteristic of cloud computing and the cloud’s payasyougo pricing model can reduce the cloud clients’ cost. However, maintaining Service Level Agreements (SLAs) with the end users obliges the cloud clients to deal with a cost/performance tradeoff [1]. This tradeoff can be balanced by finding the minimum amount of resources the cloud clients need to fulfill their SLAs obligations. In addition, the cloud clients’ workload varies with time; hence, the cost/performance tradeoff needs to be justified in accordance with the incoming workload. Autoscaling systems are developed to automatically balance the cost/performance tradeoff.
There are two main classes of autoscaling systems in the InfrastructureasaService (IaaS) layer of the cloud computing: reactive and predictive. Reactive autoscaling systems are the most widely used autoscaling systems in the commercial clouds. The reactive systems scale out or in a cloud service according to its current performance condition [2]. Although the reactive autoscaling systems are easy to understand and use, they suffer from neglecting the virtual machine (VM) bootup time which is reported to be between 5 and 15 min [3]. Neglecting the VM bootup time results in the underprovisioning condition which causes SLAs violation. Predictive autoscaling systems try to solve this problem by forecasting the cloud service’s future workload and adjusting the compute and the storage capacity in advance to meet the future needs.
The predictive autoscaling systems generate a scaling decision based on the future forecast of a performance indicator’s value. Therefore, to improve the accuracy of the predictive autoscaling systems, researchers have strived to improve the accuracy of the prediction techniques that are being used in the autoscaling systems (see [4] for a comprehensive overview of the autoscaling prediction techniques). According to [4], the most dominant prediction technique in the IaaS layer of the cloud autoscaling domain is timeseries prediction. Timeseries prediction techniques use the historical values of a performance indicator to forecast its future value. Although in recent years many innovative timeseries prediction techniques have been proposed for the autoscaling systems, the existing approaches neglect the influence of the performance indicator pattern (i.e., how the performance indicator values change over time) on the accuracy of the timeseries prediction techniques. This paper proposes an autonomic prediction suite using the decision fusion technique for the resource provisioning of the IaaS layer of the cloud computing environment. The proposed suite identifies the pattern of the performance indicator and accordingly selects the most accurate technique to predict the near future value of the performance indicator for better resource management. The central hypothesis in this paper that serves as the fusion rule of the prediction suite is:
The prediction accuracy of the predictive autoscaling systems is impacted positively by using different prediction algorithms for the different cloud workload patterns
In order to lay out the theoretical groundwork of the prediction suite, this paper first examines the influence of the cloud service’s incoming workload patterns on the mathematical core of the learning process. Previous studies on the predictive autoscaling techniques in the IaaS layer of cloud computing [2, 5, 6] are limited to the experimental evaluation. To the best of our knowledge, none of the research efforts in the predictive autoscaling domain has investigated the theoretical foundations of the predictive autoscaling techniques. Establishing a formal foundation is essential to obtain a solid and more generic understanding of various autoscaling prediction algorithms. Thus, to support the proposed prediction suite, this paper performs a formal study of the theories that have been used in the predictive autoscaling systems. Further, this paper investigates the components that theoretically affect the accuracy of the models. The theoretical investigation provides a formal analysis and explanation for the behaviors of the timeseries prediction algorithms in the cloud environment with different workload patterns. In addition, this paper proposes four subhypotheses in section Theoretical investigation of the hypothesis.
According to the theoretical discussion, the risk minimization principle that is used by the timeseries prediction algorithms affects the algorithms’ accuracy in the environments with the different workload patterns (see Section Theoretical investigation of the hypothesis). Furthermore, to experimentally validate the formal discussion, this paper examines the influence of the workload patterns on the accuracy of three timeseries prediction models: the Support Vector Machine (SVM) algorithm and two variations of the Artificial Neural Network (ANN) algorithm (i.e., MultiLayer Perceptron (MLP) and MultiLayer Perceptron with Weight Decay (MPLWD)). The SVM and the MLPWD algorithms use Structural Risk Minimization (SRM) principle, but the MLP algorithm uses Empirical Risk Minimization (ERM) principle to create the prediction model. Comparing the MLP with the MLPWD algorithm isolates the influence of the risk minimization principle on the prediction accuracy of the ANN algorithms. Therefore, comparing the MLP with the MLPWD shows the impact of the risk minimization principle on the prediction accuracy of the ANN algorithms. In addition, since the SVM and the MLPWD algorithms use the same risk minimization approach, comparing the SVM algorithm with the MLPWD algorithm isolates the influence of the regression model on the prediction accuracy.

Proposing an autonomic prediction suite which chooses the most suitable prediction algorithm based on the incoming workload pattern,

Providing the theoretical foundation for estimating the accuracy of the timeseries prediction algorithms in regards to the different workload patterns,

Investigating the impact of the risk minimization principle on the accuracy of the regression models for different workload patterns, and

Evaluating the impact of the input window size on the performance of the risk minimization principle.
TPCW web application and Amazon Elastic Compute Cloud (Amazon EC2) are respectively used as the benchmark and the cloud infrastructure in our experiments. It should be noted that this paper is scoped to the influence of the workload patterns on the prediction results at the IaaS layer of the cloud computing. Other IaaS management aspects (such as the VM migration and the physical allocation of the VMs) are out of the scope of this paper.
The remainder of this paper is organized as follows: Background and related work section discusses the background and the related work. In Selfadaptive workload prediction suite section a high level design for the selfadaptive prediction suite is proposed. Theoretical investigation of the hypothesis section, describes the principles of the learning theory and mathematically investigates the hypothesis. Section Experimental investigation of the hypotheses presents the experimental results to support the theoretical discussion. The conclusion and the possible directions for the future research are discussed in Conclusions and future work section.
Background and related work
In this section, the background concepts that are used in the paper and the related work are introduced. Subsection Workload is an overview of the workload concept and its patterns. Subsections Decision making and Prediction techniques provide an overview of the most dominant autoscaling approaches in two broad categories: decision making and prediction techniques.
Workload

Static workload is characterized by a constant number of requests per minute. This means that there is normally no explicit necessity to add or remove the processing power, the memory or the bandwidth for the workload changes (Fig. 1).

Growing workload represents a load that rapidly increases (Fig. 2).

Periodic workload represents regular periods (i.e., seasonal changes) or regular bursts of the load in a punctual date (Fig. 3).

Onandoff workload represents the work to be processed periodically or occasionally, such as the batch processing (Fig. 4).

Unpredictable workloads are generalization of the periodic workloads as they require elasticity but are not predictable. This class of workload represents the constantly fluctuating loads without regular seasonal changes (Fig. 5).
Resource allocation for the batch applications (i.e., onandoff workload pattern) is usually referred to as scheduling which involves meeting a certain job execution deadline [4]. Scheduling is extensively studied in the grid environments [4] and also explored in the cloud environments, but it is outside of the scope of this paper. Similarly, the cloud services with a stable (or static) workload pattern do not require an autoscaling system for resource allocation per se. Therefore, this paper considers cloud services with the periodic, growing, and unpredictable workload patterns.
Decision making
The authors in [4] group the existing autoscaling approaches into five categories: rule based technique, reinforcement learning, queuing theory, control theory, and timeseries analysis. Among these categories, the timeseries analysis focuses on the prediction side of the resource provisioning task and is not a “decision making” technique per se. In contrast, the rulebased technique is a pure decision making mechanism while the rest of the autoscaling categories play the predicator and the decision maker roles at the same time.
The rule based technique is the only approach which is widely used in the commercial autoscaling systems [9–11]. The popularity of this approach is due to its simplicity and intuitive nature. The rule based approaches typically have six parameters: an upper threshold (thrU), a lower threshold (thrL), durU and durL that define how long the condition must be met to trigger a scaling action, and inL and inU which indicate the cool down periods after the scale out and scale in actions [4]. The performance of the rule based technique highly dependents on these parameters. Therefore, finding the appropriate values for these parameters is a tricky task. A common problem in the rule based autoscaling, which occurs due to an inappropriate threshold value, is the oscillations in the number of the leased VMs. In fact, the durU and the durL parameters are introduced to decrease the number of the scaling actions and reduce the VM oscillations. Some researchers have proposed alternative techniques to address the VM oscillation problem. For instance, the work in [12] uses a set of four thresholds and two durations. Moreover, some research works (such as [13]) have adopted a combination of the rules and a voting system to generate the scaling actions.
Prediction techniques
The most dominant prediction technique in the cloud autoscaling domain is the timeseries analysis [4]. In order to use the timeseries analysis for the cloud autoscaling purposes, a performance indicator is periodically sampled at fixed intervals. The result is a timeseries containing a sequence of the last observations of the performance indicator. The timeseries prediction algorithms extrapolate this sequence to predict the future value. Some of the timeseries prediction algorithms that are used in the existing cloud resource provisioning systems are Moving Average, Autoregression, ARMA, exponential smoothing, and machine learning approaches [4].
Moving average generally generates poor results for the timeseries analysis [4]. Therefore, it is usually applied only to remove the noise from the timeseries. In contrast, autoregression is largely used in the cloud autoscaling field. The results in [13] show that the performance of the autoregression algorithm depends on the monitoring interval length, the size of the history window, and the size of the adaptation window. ARMA is a combination of the moving average and the autoregression algorithms. The authors in [14] use ARMA to predict the future workload. Machine learning algorithms are used in [3] and [6] to carry out the prediction task in the cloud resource provisioning problem. The authors in [6] verify the Artificial Neural Networks (ANN) and the Linear Regression (LR) algorithms to predict the future value of the CPU load. The results in [6] conclude the ANN prediction model surpasses the LR algorithm in terms of prediction accuracy in the autoscaling domain. In addition, the authors in [3] compare the SVM, the ANN and the LR algorithms and show the SVM algorithm outperforms the ANN and the LR algorithms to predict the future CPU utilization, response time, and throughput of a cloud service. Furthermore, the authors in [15] propose a selfadaptive method that uses a decision tree to assign the incoming workload to one of the forecasting methods based on the workload characteristics. According to the results of [15] the overall prediction accuracy increases by using different prediction algorithms for different workloads. However, to the best of our knowledge, none of the research works in the predictive autoscaling domain investigates the theoretical foundations of the correlation between the different workload patterns and the accuracy of the prediction algorithms. Therefore, this paper performs a formal study of the theories that are closely related to the regression models used in the predictive autoscaling systems and investigates the workload characteristics that affect the accuracy of the regression models.
Selfadaptive workload prediction suite
This section proposes a high level architectural design of the selfadaptive workload prediction suite. The selfadaptive suite uses the decision fusion technique to increase the prediction accuracy of the cloud autoscaling systems. Decision fusion is defined as the process of fusing information from individual data sources after each data source has undergone a preliminary classification [16]. The selfadaptive prediction suite aggregates the prediction results of multiple timeseries prediction algorithms to improve the final prediction accuracy. The different timeseries prediction techniques use different risk minimization principles to create the prediction model. The theoretical analysis shows that the accuracy of a risk minimization principle depends on the complexity of the timeseries. In addition, since the complexity of a timeseries is defined by its corresponding workload pattern, the theoretical analysis concludes that the accuracy of a regression model is a function of the workload pattern (see Theoretical investigation of the hypothesis section).
Furthermore, Experimental investigation of the hypotheses section experimentally confirms the theoretical conclusion of Theoretical investigation of the hypothesis section. In the experiment two versions of an ANN algorithm (i.e., multilayer perceptron (MLP) and multilayer perceptron with weight decay (MLPWD)) and the Support Vector Machine (SVM) algorithm are used to predict three groups of timeseries. Each timeseries group represents a different workload pattern. The objective of the experiment is to investigate the correlation between the accuracy of the risk minimization principle and the workload pattern.

To predict the future workload in an environment with the unpredictable workload pattern it is better to use MLP algorithm with a large sliding window size.

To predict the future workload in an environment with the periodic workload pattern it is better to use MLPWD algorithm with a small sliding window size.

To predict the future workload in an environment with the growing workload pattern it is better to use SVM algorithm with a small sliding window size.
The selfadaptive prediction suite uses the experimental results as the fusion rule to aggregate the SVM, the MLP, and the MLPWD prediction algorithms in order to improve the prediction accuracy of the cloud autoscaling systems. The prediction suite senses the pattern of the incoming workload and automatically chooses the most accurate regression model to carry out the workload prediction. Each workload is represented by a timeseries. To identify the workload pattern, the proposed selfadaptive suite decomposes the incoming workload to its components by using Loess package of the R software suite [17]. The Loess component decomposes a workload to its seasonal, trend, and remainder components. If the workload has strong seasonal and trend components which repeat at fixed intervals, then the workload has periodic pattern. If the trend of the component is constantly increasing or decreasing, then the workload has growing pattern. Otherwise the workload has unpredictable pattern.
The selfadaptive suite constantly monitors the characteristics of the incoming workload (i.e., seasonal and trend components) and replaces the prediction algorithm according to a change in the incoming workload pattern. To this end, the autonomic system principles are used to design the selfadaptive workload prediction suite.
A typical autonomic system consists of a context, an autonomic element, and a computing environment [20–22]. In addition, the autonomic system receives the goals and gives the feedback to an external environment. An autonomic element regularly senses the sources of change by using the sensors. In the prediction suite, the sensor is the change in the workload pattern (Fig. 7).
The presented cloud autoscaling architecture consists of the cloud workload context, the cloud auto scaling autonomic system, and the cloud computing scaling decisions. The cloud workload context consists of two metaautonomic elements: workload pattern and cloud auto scaling. In addition, a component for autonomic manager, knowledge, and goals is added to the architecture.
A careful examination of the strategy design pattern (Fig. 10) shows that the context is in turn designed by using the template design pattern. The intent of the template design pattern is to define the skeleton of an algorithm (or a function) in an operation that defers some steps to subclasses [23].
In a generic strategy design pattern, the context is simply an abstract class with no concrete subclasses. We have modified this by using the template pattern to introduce the concrete subclasses to represent the different workload patterns and to implement the workload pattern context as an autonomic element. This way, the cloud workload pattern is determined automatically and the pattern interface is passed on to the predictor autonomic element which then invokes the appropriate prediction algorithm for the workload pattern. After which the training is carried out and the testing (i.e., the prediction) using the appropriate algorithm is done.
Theoretical investigation of the hypothesis
Machine learning can be classified into the supervised learning, semisupervised learning, and unsupervised learning. The supervised learning deduces a functional relationship from the training data that generalizes well to the whole dataset. In contrast, the unsupervised learning has no training dataset and the goal is to discover the relationships between the samples or reveal the latent variables behind the observations [5]. The semisupervised learning falls between the supervised and the unsupervised learning by utilizing both of the labeled and the unlabeled data during the training phase [24]. Among the three categories of the machine learning, the supervised learning is the best fit to solve the prediction problem in the autoscaling area [5]. Therefore this paper investigates the theoretical foundation of the supervised learning.
To accept or reject the hypothesis, we start with the formal definition of the machine learning and then explore the risk minimization principle as the core function of the learning theory. The definitions in the following subsections are taken from [25].
Formal definition of the machine learning process
 1.
A generator of random vectors x. The generator uses a fixed but unknown distribution P(x) to independently produce the random vectors.
 2.
A supervisor which is a function that returns an output vector y for every input vector x, according to a conditional distribution function P(yx). The conditional distribution function is fixed but unknown.
 3.
A learning machine that is capable of implementing a set of functions f (x, w), w ∈ W, where x is a random input vector, w is a parameter of the function, and W is a set of abstract parameters that are used to index the set of functions f (x, w) [25].
To improve the accuracy, the functional risk R(w) should be minimized over a class of functions f (x, w), w ∈ W. The problem in minimizing the functional risk is that the joint probability distribution P(x, y) = P(yx)P(x) is unknown and the only available information is contained in the training set.

Supervisor’s response is analogous to the timeseries of workload values which is determined by P (x, y).

Independent observations are equivalent of the training dataset and indicate the historical values of the workload.

Learning machine maps to the Predictor component.
In the autoscaling problem domain, P (x, y) refers to the workload distribution. Suppose that we have a set of candidate predictor functions f (x, w), w ∈ W and we want to find the most accurate function among them. Given that only the workload values for the training duration are known, the functional risk R(w) cannot be calculated for the candidate predictor functions f (x, w), w ∈ W; hence, the most accurate prediction function cannot be found.
Empirical risk minimization
The empirical risk minimization (ERM) assumes that the function \( f\left({x}_i,\ {w}_l^{*}\right) \), which minimizes E(w) over the set w ∈ W, results in a functional risk \( R\left({w}_l^{*}\right) \) which is close to minimum.
According to the theory of the uniform convergence of empirical risk to actual risk [26], the convergence rate bounds are based on the capacity of the set of functions that are implemented by the learning machine. The capacity of the learning machine is referred to as VCdimension (for VapnikChervonenkis dimension) [27] that represents the complexity of the learning machine.
Applying the theory of uniform convergence to the autoscaling problem domain concludes that the convergence rate bounds in the autoscaling domain are based on the complexity (i.e., VCdimension) of the regression model that is used in the Predictor component.
Equation (4) determines the bound of the regression model’s error. Based on this equation, the probability of error of the regression model is less than the frequency of error in the training set plus the confidential interval. According to Eq. (4) the ERM principle is good to be used when the confidence interval is small (i.e., the functional risk is bounded by the empirical risk).
Structural risk minimization
Equations (4) and (5) show the bound of the regression model’s error and the confidence interval. In Eqs. (4) and (5), l is the size of training dataset and h is the VCdimension or the complexity of the regression model. According to Eq. (5) when \( \frac{l}{h} \) is large, the confidence interval becomes small and can be neglected. In this case, the functional risk is bounded by the empirical risk, which means the probability of error on the testing dataset is bounded by the probability of error on the training dataset.
On the other hand, when \( \frac{l}{h} \) is small, the confidence interval cannot be neglected and even E(w) = 0 does not guarantee a small probability of error. In this case to minimize the functional risk R(w), both E(w) and \( {C}_0\left(\frac{l}{h},\eta \right) \) (i.e., the empirical risk and the confidence interval) should be minimized simultaneously. To this end, it is necessary to control the VCdimension (i.e., complexity) of the regression model. In other words, when the training dataset is complex, the learning machine increases the VCdimension to shatter^{1} the training dataset. By increasing the VCdimension, the regression model becomes strongly tailored to the particularities of the training dataset and does not perform well to new data (the overfitting situation).
Therefor the structural risk minimization (SRM) principle describes a general model of the capacity (or complexity) control and provides a tradeoff between the hypothesis space complexity (i.e., the VCdimension) and the quality of fitting the training data.
Workload pattern effects on prediction accuracy of empirical and structural risk minimizations

Hypothesis 1a: The structural risk minimization principle performs better in the environments with the periodic and growing (i.e., predictable) workload patterns.

Hypothesis 1b: The empirical risk minimization principle performs better in the environments with the unpredictable workload pattern.

Hypothesis 1c: Increasing the window sizes does not have a positive effect on the performance of the structural risk minimization principle in the cloud computing environments.

Hypothesis 1d: Increasing the window size improves the performance of the empirical risk minimization principle in the unpredictable environments and has no positive effect on the performance of the empirical risk minimization principle in the periodic and the growing environments.
Making these subhypotheses provides a basis for proving the main hypothesis of this research. To systematically prove the subhypotheses, this section provides a theoretical reasoning to explain the empirical and the structural risk minimization principles behaviors in regards to the different workload patterns in the cloud computing environment.
As shown in Empirical risk minimization section, \( \frac{l}{h} \) determines whether to use the empirical or the structural risk minimizations. In this paper we assume the training dataset size (i.e., , l) is static, therefore for the small values of h, \( \frac{l}{h} \) fraction is large. In this case, the confidence interval is small and the functional risk is bounded by the empirical risk.
In environments with the predictable workload patterns (i.e., periodic or growing) the training and the testing datasets are not complex. Thus, in such environments h is small and the empirical and the structural risk minimizations perform well. However, it is possible that the empirical risk minimization becomes over fitted against the training dataset. The reason is that, although the periodic and the growing workloads follow a repeatable pattern, it is highly probable that some of the data points in the training dataset do not follow the main pattern of the timeseries (i.e., noise data). The noise in the data increases the complexity of the regression model. Increasing the complexity (i.e., VCdimension) increases the confidence interval as well as the probability of error (see Eq. (5)), which reduces the ERM accuracy. On the other hand, the SRM principle controls the complexity by neglecting the noise in the data, which reduces the confidence interval. Therefore, in the environments with the periodic and the growing workload patterns the SRM approach is expected to outperform the ERM approach (hypothesis 1a).
The same reasoning applies to the environments with the unpredictable workload pattern. In the unpredictable environments there is no distinctive workload trend and none of the data points should be treated as the noise. In the unpredictable environments, the ERM approach increases the VCdimension to shatter all of the training data points. However, since the training and the testing datasets follow the same unpredictable pattern, increasing the VCdimension helps the prediction model to predict the fluctuations of the testing dataset, as well. On the contrary, the SRM approach controls the VCdimension to decrease the confidence interval. Therefore, the SRM approach cannot capture the fluctuating nature of the unpredictable workload pattern and trains a less accurate regression model compared to the ERM approach (hypothesis 1b).
In the machine learning domain, window size refers to the input size of the prediction algorithm. Increasing the window size provides more information for the prediction algorithm and is expected to increase the accuracy of the prediction model. However, increasing the input size makes the prediction model more complex. To manage the complexity, the SRM approach compromises between the accuracy and the VCdimension. Therefore, increasing the window size does not necessarily affect the accuracy of the SRM prediction model. (Hypothesis 1c).
Furthermore, because the ERM approach cannot control the complexity of the regression model, increasing the window size increases the VCdimension of the prediction model. In the predictable environments (i.e., the periodic and the growing patterns) the training and the testing datasets are not complex and the ERM principle is able to capture the timeseries behaviors by using smaller window sizes. However, increasing the window size in the predictable environments increases the noise in the training dataset which causes a bigger confidence interval, and reduces the accuracy of the prediction model. On the other hand, due to the fluctuations in the unpredictable datasets, none of the data points in the training dataset should be considered as a noise. Therefore, in the unpredictable environments increasing the window size helps the ERM principle to shatter more training data. However, since the training and the testing datasets follow the same unpredictable pattern, increasing the window size improves the ERM precision to predict the fluctuations of the testing dataset, as well (hypothesis 1d).
Experimental investigation of the hypotheses section experimentally investigates the theoretical discussion of this section and evaluates the four subhypotheses.
Summary
The research in the learning theory provides a rich set of knowledge in learning the complex relationships and patterns in the datasets. Vapnik et al. show that the proportion of the training dataset size to the complexity of the regression model determines whether to use the empirical or the structural risk minimizations [25]. In the autoscaling domain, the Predictor component corresponds to the learning machine of the leaning process. Therefore, to improve the accuracy of the Predictor component, the risk minimization principle should be determined based on the complexity of the prediction techniques (i.e., the VCdimension) and the training dataset size. The workload pattern complexity is the main driving factor of the Predictor component’s VCdimension. Four subhypotheses are introduced in order to experiment the risk minimization principles visàvis the different workload patterns.
Experimental investigation of the hypotheses
The main goal of the experiment presented in this section is to verify the empirical and the structural risk minimization principles behaviors in the environments with the periodic, growing, and unpredictable workload patterns. There are various learning algorithms that have been used as the predictor for the autoscaling purposes (see Prediction techniques section) which use either the empirical or the structural risk minimizations. In our previous work (see [2]) the SVM algorithm which is based on the structural minimization and the ANN algorithm which uses the empirical minimization principle were used. Our experimental results in [2] showed that in the environments with the periodic and the growing workload patterns the SVM algorithm outperforms the ANN algorithm, but ANN has a better accuracy in forecasting the unpredictable workloads. These results support the theoretical discussion in Evaluation metrics section. However, in this paper the goal is to zeroin on two different implementations of the ANN algorithm in order to compare the effect of the structural and the empirical risk minimizations on the ANN prediction accuracy. Therefore, in this experiment two implementations of the ANN algorithm (i.e., MLP and MLPWD) are used to isolate the influence of the risk minimization principle on the prediction accuracy. MLP uses the ERM principle and MLPWD uses the SRM principle. In addition, since both of the MLPWD and the SVM algorithms use the SRM principle, the accuracy of the MLPWD is compared with the SVM accuracy to isolate the impact of the regression model structure on the accuracy of the machine learning algorithms.
Sections Multilayer perceptron with empirical risk minimization, Multilayer perceptron with structural risk minimization, and Support vector machines briefly explain MLP, MLPWD, and SVM algorithms, respectively. Sections Training and testing of MLP and MLPWD, Evaluation metrics, and Experimental results describe the experiment and the results.
Multilayer perceptron with empirical risk minimization
There are different variations of the Artificial Neural Network (ANN), such as backpropagation, feedforward, time delay, and error correction [5]. MLP is a feedforward ANN that maps the input data to the appropriate output.
Multilayer perceptron with structural risk minimization

Structure given by the architecture of the neural network.

Structure given by the learning procedure

Structure given by the preprocessing.
The nested structure can be created by appropriately choosing Lagrange multipliers γ _{1} > γ _{2} > … > γ _{ n }. According to Eq. (10), the wellknown weightdecay procedure refers to the structural minimization [25].
Authors of [29] have shown that the conventional weight decay technique can be considered as the simplified version of the structural risk minimization in the neural networks. Therefore, in this paper we use MLPWD algorithm to study the accuracy of the structural risk minimization for predicting the different classes of workload.
Support vector machines
Support Vector Machine (SVM) is used for many machine learning tasks such as pattern recognition, object classification, and regression analysis in the case of the time series prediction. Support Vector Regression (SVR), is the methodology by which a function is estimated by using the observed data. In this paper the SVR and the SVM terms are used interchangeably.
Experimental setup
In this experiment workload represents the web service requests arrival rate. Workload is a key performance indicator of a given web service that can be used to calculate other performance indicators (such as utilization, and throughput) of that web service. Furthermore, monitoring workload of a web service is straightforward and can be carried out by using instrumentation technique. Therefore, in this experiment workload of the web service is the target class of the prediction techniques.
The goal of this experiment is to compare the accuracy of the MLP, the MLPWD, and the SVM algorithms for predicting the periodic, the growing, and the unpredictable workload patterns. The required components to conduct this experiment are: a benchmark to generate the workload patterns, an infrastructure to deploy the benchmark, and an implementation of the prediction algorithms. Java implementation of TPCW [30] and Amazon EC2 are used as the benchmark and the infrastructure, respectively. In addition, the implementation of MultiLayer Perceptron and Support Vector Machine algorithms in WEKA tool is used to carry out the prediction task.
The MLP algorithm in WEKA tool [31] has various configuration parameters including a parameter to switch on/off the weight decay feature (i.e., decay parameter). Therefore, to use the empirical risk minimization the default value of the decay parameter (i.e., off) is used. Also, to use the structural risk minimization, the decay parameter is switched on.
Hardware specification of servers for experiment
Memory  Processor  Storage  

Client  1 GB  4 core  8 GB 
Web server  1 GB  4 core  8 GB 
Database  2 GB  8 core  20 GB 
On the client side, a customized script is used along with the TPCW workload generator to produce the growing, the periodic, and the unpredictable workload patterns. In this experiment workload represents the webpage requests arrival rate. Each of the workload patterns is generated for 500 min. To improve accuracy of the results, the experiment is repeated 10 times for each workload pattern. On the webserver machine, the total number of the user requests is stored in the log files every minute. This results in 10 workload trace files, for each of the workload patterns. Each of the workload trace files has 500 data points. We refer to the workload trace files as the actual workloads in the rest of this paper.
Training and testing of MLP and MLPWD
In our previous work [1] we proved that in the autoscaling domain the optimum training duration for the ANN and the SVM algorithms is 60% of the experiment duration. Therefore, in this experiment the first 300 data points (i.e., 60%) of the actual workload trace files are considered as the training datasets and the rest 200 data points are dedicated to the test.
Another important factor in the training and the testing of the timeseries prediction algorithms is the dimensionality of the datasets (i.e., the number of the features that exist in the dataset). In this experiment, the actual datasets have only one feature, which is the number of the requests that arrive at the cloud service per minute. Therefore, in order to use the machine learning prediction algorithms sliding window technique is used. The sliding window technique uses the last k samples of a given feature to predict the future value of that feature. For example, to predict value of b _{ k + 1} the sliding window technique uses [b _{1}, b _{2}, …, b _{ k }] values. Similarly, to predict b _{ k + 2}, the sliding window technique updates the historical window by adding the actual value of b _{ k + 1} and removing the oldest value from the window (i.e., the sliding window becomes [b _{2}, b _{3}, …, b _{ k + 1}]). Setting the sliding window size is not a trivial task. Usually the smaller window sizes do not reflect the correlation between the data samples thoroughly, while using the bigger window size increases the chance of the overfitting. Thus, in this experiment the effect of the sliding window size on the prediction accuracy of MLP and MLPWD is studied, as well.
MLP and MLPWD configurations
Parameter Name  MLP Value  MLPWD 

Learning Rate (ρ)  0.3  0.3 
Momentum  0.2  0.2 
Validation Threshold  20  20 
Hidden Layers  1  1 
Hidden Neurons  (attributes + classes)/2  (attributes + classes)/2 
Decay  False  True 
SVM configuration
Parameter Name  Value 

C (complexity parameter)  1.0 
kernel  RBF Kernel 
regOptimizer  RegSMOImproved 
Evaluation metrics
The MAE metric is a linear score which assumes all of the individual errors are weighted equally. Moreover, the RMSE is most useful when the large errors are particularly undesirable [34].
Experimental results
MAE and RMSE values (periodic pattern)
Phase  Window size  Average MAE  Average RMSE  

MLP  MLPWD  SVM  MLP  MLPWD  SVM  
Training  2  6.88  4.16  4.65  8.55  6.65  7.31 
3  6.7  4.12  4.62  8.32  6.32  7  
4  6.5  4.11  4.62  8.12  6.12  6.99  
5  5.95  4.05  4.52  8  6.44  6.8  
6  5.78  4.02  4.52  7.56  6.12  6.7  
7  5.68  3.88  4.32  7.5  6.2  6.7  
8  5.68  3.95  4.3  7.12  6.21  6.6  
9  5.51  4.02  4.3  6.9  6.18  6.8  
10  4.98  4  4.31  6.52  6.18  6.7  
Testing  2  6.2  6  6  8.31  8  8.1 
3  6.3  5.9  6  8.31  7.9  7.98  
4  6.3  5.8  6.1  8.34  7.9  8.05  
5  6.99  5.9  6.2  8.62  7.8  8.15  
6  7.15  5.7  6.1  8.77  7.4  8  
7  7.25  5.72  6  9.12  7  7.71  
8  7.98  5.75  6  9.15  7  7.65  
9  8.56  5.66  5.8  10.36  7.1  7.5  
10  9.2  5.58  5.7  11.89  6.9  7.6 
MAE and RMSE values (growing pattern)
Phase  Window size  MAE  RMSE  

MLP  MLPWD  SVM  MLP  MLPWD  SVM  
Training  2  2.5  2.1  1.7  3.9  4.02  3.8 
3  2.8  2.3  1.75  3.9  3.82  3.7  
4  2.7  2.3  1.8  4.1  3.87  3.6  
5  2.7  2.5  1.8  3.98  3.89  3.7  
6  2.6  2.4  1.8  3.88  3.71  3.6  
7  2.7  2.4  1.81  3.84  3.81  3.6  
8  2.66  2.33  1.78  3.78  3.62  3.5  
9  2.8  2.22  1.78  3.95  3.7  3.3  
10  2.57  2.25  1.78  4  3.7  3.4  
Testing  2  3.77  3  2.5  4.4  4  3.7 
3  3.85  3.6  2.5  4.91  4.21  3.7  
4  3.55  3.5  2.6  4.92  4.5  3.65  
5  3.64  3.41  2.4  4.71  4.22  3.6  
6  3.89  3.42  2.3  4.52  4.31  3.7  
7  3.84  3.31  2.2  5.11  4  3.7  
8  3.95  3.02  2.2  5.52  3.99  3.4  
9  4.12  3  2.2  5.98  3.95  3.5  
10  4.1  2.8  2.2  6.02  3.9  3.7 
MAE and RMSE values (unpredictable pattern)
Phase  Window size  MAE  RMSE  

MLP  MLPWD  SVM  MLP  MLPWD  SVM  
Training  2  1.4  1.74  1.81  2.6  2.9  3.15 
3  1.42  1.73  1.73  2.61  2.88  3.2  
4  1.43  1.72  1.78  2.59  2.87  3.31  
5  1.4  1.73  1.75  2.55  2.89  3.15  
6  1.35  1.69  1.73  2.4  2.91  3.19  
7  1.46  1.66  1.72  2.48  2.98  3.2  
8  1.44  1.65  1.74  2.31  2.74  3.2  
9  1.48  1.66  1.66  2.2  2.65  3.16  
10  1.44  1.65  1.68  2.15  2.74  3.17  
Testing  2  2.6  2.82  3.12  3.01  3.41  3.31 
3  2.5  2.8  3.1  3  3.4  3.64  
4  2.34  2.77  2.9  3  3.38  3.7  
5  2.21  2.76  2.88  2.98  3.41  371  
6  1.98  2.44  2.89  2.9  3.42  3.78  
7  1.65  2.4  2.85  2.8  3.21  3.88  
8  1.42  2.1  2.91  2.7  3.2  3.9  
9  0.98  2.11  2.92  2.4  3.11  4.1  
10  0.98  2.1  2.9  2.2  2.8  4.18 
Hypothesis 1.a: the SRM principle performs better in the environments with the predictable workload patterns
Based on the results, the SRM principle is more accurate than the ERM principle for forecasting the predictable workload patterns (i.e., the periodic and the growing workloads).
Hypothesis 1.b: the ERM principle performs better in the environments with the unpredictable workload patterns
Hypothesis 1.c: increasing the window sizes does not have a positive effect on the performance of the SRM principle
According to Tables 4 and 5 in the periodic and the growing environments increasing the window size does not affect the accuracies of the MLPWD and the SVM algorithms. The reason is because the SRM principle controls the prediction model’s complexity by neglecting some of the training data points. As a result, increasing the window size neither increase nor decreases the accuracy of the prediction models.
By increasing the window size in the unpredictable environments the MLPWD accuracy slightly improves while the SVM accuracy slightly reduces (Table 6). However, the changes in the accuracies of the MLPWD and SVM in the unpredictable environments are negligible. Therefore, it can be concluded that for all of the workload patterns, increasing the window size has no substantial effect on the prediction accuracy of the SRM principle.
Hypothesis 1.d: Increasing the window size improves the performance of the ERM principle in the unpredictable environments and has no positive effect of the performance the ERM principle in the predictable environments.
Based on Fig. 16, for the smaller window sizes in the periodic environment the MLP accuracy is close to the MLPWD and the SVM accuracies. However, by increasing the window size, the MLP accuracy decreases. Similar to the results of the periodic pattern, in the environments with the growing workload pattern, the MLP prediction accuracy has a decreasing trend but does not change too much by increasing the window size. This is because increasing the window size of the MLP algorithm leads to the overfitting issue which decreases the MLP accuracy. As shown in Fig. 16, during the training phase the MLP accuracy increases by increasing the window size. This shows the MLP algorithm becomes over fitted to the training dataset by increasing the window size. The results confirm that in the environments with the periodic workload pattern, increasing the sliding window size has no positive effect on the prediction accuracies of the ERM principle.
Unlike the growing and the periodic patterns, increasing the window size has a positive effect on the prediction accuracy of the MLP algorithm in the environment with the unpredictable workload pattern. The reason is that in the unpredictable environments there are many fluctuations in the data; therefore, the ERM prediction models cannot extract the relationships between the features thoroughly. Thus, increasing the window size increases the input size of the algorithms, which improves the ERM’s prediction accuracies.
Experimental results conclusion
The results of the experiments support the theoretical conclusion presented in Section Workload pattern effects on prediction accuracy of empirical and structural risk minimizations, which suggests the use of the SRM principle in the environments with the growing and the periodic workload patterns. In addition, the experimental results show that increasing the window size does not improve the SRM accuracy. On the other hand, for the environments with the unpredictable workload pattern, it is better to use the ERM principle with the bigger window sizes. According to the experimental results, Section Selfadaptive workload prediction suite proposes an autonomic prediction suite which chooses the most accurate prediction algorithm based on the incoming workload pattern.
Conclusions and future work
This paper proposed a selfadaptive prediction suite with an aim to improve the accuracy of predictive autoscaling systems for the IaaS layer of cloud computing. The prediction suite uses the decision fusion technique and facilitates the selection of the most accurate prediction algorithm and the window size with respect to the incoming workload pattern. The proposed architecture used the strategy and the template design patterns which guarantees the automatic runtime selection of the appropriate prediction algorithm as well as detection of a suitable workload pattern and an appropriate window size. To lay out the theoretical foundation of the prediction suite, this paper proposed and evaluated a main hypothesis and four subhypotheses on the accuracy of several timeseries prediction models in the IaaS layer of cloud computing. According to the main hypothesis, the prediction accuracy of the predictive autoscaling systems can be increased by choosing an appropriate timeseries prediction algorithm based on the incoming workload pattern.
To the best of our knowledge, the theoretical foundation of the predictive autoscaling systems has not been investigated in the existing research works. Therefore, this paper performs a formal study of the theories that are closely related to the accuracy of predictive autoscaling systems. To evaluate the main hypothesis, we have proposed four subhypotheses concerning the influence of the risk minimization principle on the prediction accuracy of the regression models in the environments with different workload patterns. To test these subhypotheses, the theoretical fundamentals of the prediction algorithms were investigated through analyzing the learning theory and the risk minimization principles.
Based on the formal analysis, the structural risk minimization outperforms the empirical risk minimization for predicting the periodic and the growing workload patterns, but the empirical risk minimization is a better fit for forecasting the unpredictable workload pattern. Furthermore, experiments were conducted to validate the theoretical discussion. In the experiments, the influence of the risk minimization principle on the accuracy of the MLP and the MLPWD algorithms for predicting different workload patterns was examined. Moreover, the experiments compared the accuracy of the MLPWD and the SVM to isolate the impact of the regression model’s structure on the prediction accuracy. The experimental results support the theoretical discussion. Also, the results show that increasing the sliding window size only has positive impact on the accuracy of the MLP algorithm in the environments with the unpredictable workload pattern. However, in other environments (i.e., growing or periodic workload patterns), increasing the window size does not improve the prediction accuracies of the MLP, MLPWD, and the SVM algorithms. The theoretical analysis and the experimental results demonstrated that using an appropriate prediction algorithm based on the workload pattern increases the prediction accuracy of the autoscaling systems. Thus, based on the theoretical and experimental results in this paper, we can accept the main hypothesis that is, the prediction accuracy of timeseries techniques is positively impacted by using different prediction algorithms for the different cloud workload patterns.
In the current work we assume that the database tier has no negative impact on the autoscaling prediction accuracy. Investigating the impact of the database tier on the prediction accuracy warrants further research. In addition, we aim to investigate the relationship between the database tier autoscaling and the workload patterns and the sliding window sizes. Finally, the autonomic elements in Fig. 10 will be redesigned to include more time series algorithms and possibly more work load patterns.
Shattering definition: Model f with some parameter vector θ shatters a set of data points (x _{1}, x _{2}, …, x _{ n }) if for all assignments of labels to the data points there exists a θ such that the model f makes no error evaluating that set of data points.
Declarations
Acknowledgements
We will like to express our thanks to departmental technical and administrative staff who provided resources and supports to the AYN during his PhD research work.
Authors’ contributions
This research work is primarily based on AYN’s PhD research and thesis report which was cosupervised by SAA and CL. All authors contributed to the technical aspects and the writing of the paper. AYN designed and implemented the experiments based on guidance from SAA and CL. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Nikravesh AY, Ajila SA, Lung CH (2015) Evaluating sensitivity of autoscaling decisions in environments with different workload patterns, Proceedings of the 39th IEEE International Computers, Software & Applications Conference Workshops., pp 690–695Google Scholar
 Nikravesh AY, Ajila SA, Lung CH (2015) Towards an autonomic autoscaling system for cloud resource provisioning, Proceedings of the 10th International Symposium on Software Engineering for Adaptive and SelfManaging Systems., pp 33–45Google Scholar
 Ajila SA, Bankole AA (2013) Cloud client prediction models using machine learning techniques, Proceedings of the IEEE 37th Computer Software and Application Conference., p 143Google Scholar
 LoridoBotran T, MiguelAlonso J, Lozano JA (2014) A review of autoscaling techniques for elastic applications in cloud environments. Journal of Grid Computing 12(4):559–592View ArticleGoogle Scholar
 Bankole AA (2013) Cloud client prediction models for cloud resource provisioning in a multitier web application environment, Master of Applied Science Thesis, Electrical and Computer Engineering Department, Carleton UniversityGoogle Scholar
 Islam S, Keung J, Lee K, Liu A (2012) Empirical prediction models for adaptive resource provisioning in the cloud. Journal of Future Generation Computer Systems 28(1):155–165View ArticleGoogle Scholar
 Fehling C, Leymann F, Retter R, Schupeck W, Arbitter P (2014) Cloud computing patterns: fundamentals to design, build, and manage cloud applications, 1st edn. SpringerVerlag Wien publisher, ISBN 9783709115688
 Workload Patterns for Cloud Computing (2010) [Online], Available http://watdenkt.veenhof.nu. Accessed 3 July 2010
 Amazon Elastic Compute Cloud (Amazon EC2) (2013) [Online], Available http://aws.amazon.com/ec2/. Accessed 10 Feb 2013
 RackSpace, The Open Cloud Company (2012) [Online], Available: http://rackspace.com. Accessed 12 June 2012
 RightScale Cloud management (2012) [Online], Available: http://www.rightscale.com/homev1?utm_expid=4119285885.eCMJVCEGRMuTt8X6n9PcEw.1. Accessed 20 June 2012
 Hasan MZ, Magana E, Clemm A, Tucker L, Gudreddi SLD (2012) Integrated and autonomic cloud resource scaling, Proceesings of IEEE Network Operation Management Symposium., pp 1327–1334Google Scholar
 Kupferman J, Silverman J, Jara P, Browne J (2009) Scaling into the cloud, Technical report, Computer Science Department, University of California, Santa BarbaraGoogle Scholar
 Roy N, Dubey A, Gokhale A (2011) Efficient autoscaling in the cloud using predictive models for workload forecasting, Proceesings of 4th IEEE International Conference on Cloud Computing., pp 500–507Google Scholar
 Herbst NR, Huber N, Kounev S, Amrehn E (2013) Selfadaptive workload classification and forecasting for proactive resource provisioning, Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering., pp 187–198Google Scholar
 Benediktsson JA, Kanellopoulos I (1999) Classification of multisource and hyperspectral data based on decision fusion. Journal of IEEE Transactions on Geoscience and Remote Sensing 37(3):1367–1377View ArticleGoogle Scholar
 Local polynomial regression fitting. [Online], Available: http://stat.ethz.ch/Rmanual/Rdevel/library/stats/html/loess.html. Accessed 10 Feb 2010
 Garlan D, Schmerl B (2002) Modelbased adaptation for selfhealing systems, Proceedings of the 1st Workshop on Selfhealing systems., pp 27–32Google Scholar
 Sterritt R, Smyth B, Bradley M (2005) PACT: personal autonomic computing tools, Proceedings 12th IEEE International Conference and Workshops on Engineering of ComputerBased Systems., pp 519–527Google Scholar
 Bigus JP, Schlosnagle DA, Pilgrim JR, Mills WN III, Diao Y (2002) ABLE: a toolkit for building multiagent autonomic systems. IBM Syst J 41(3):350–371View ArticleGoogle Scholar
 Littman ML, Ravi N, Fenson E, Howard R (2004) Reinforcement learning for autonomic network repair, Proceedings of International Conference on Autonomic Computing., pp 284–285Google Scholar
 Dowling J, Curran E, Cunningham R, Cahill V (2006) Building autonomic systems using collaborative reinforcement learning. Journal of Knowledge Engineering Review 21(03):231–238View ArticleGoogle Scholar
 Gemma E, Helm R, Johnson R, Vlissides J (1994) Design patterns: elements of reusable objectoriented software, 1st edn. AddisonWesley Professional publisher, ISBN 0201633612 (22nd printing, July 2001)
 Wang S, Summers RM (2012) Machine learning and radiology. Journal of Medical Image Analalysis 16(5):933–951View ArticleGoogle Scholar
 Vapnik V (1922) Principles of risk minimization for learning theory, Proceedings of Advanced Neural Information Processing Systems Conference., pp 831–838Google Scholar
 Vapnik V, Chervonenkis A (1978) Necessary and sufficient conditions for the uniform convergence of means to their expectations. Journal of Theory Probability 3(26):7–13MATHGoogle Scholar
 Sewell M (2008) VCDimension, Technical report, Department of Comuter Science University of Collage LondonGoogle Scholar
 Sewell M (2008) Structural risk minimization, Technical report, Department of Computer Science, University College LondonGoogle Scholar
 Yeh C, Tseng P, Huang K, Kuo Y (2012) Minimum risk neural networks and weight decay technique, Proceedings of 8th International Conference on Emerging Intelligent Computing Technology and Applications., pp 10–16Google Scholar
 TPCW benchmark. [Online]. Available: http://www.tpc.org/tpcw/. Accessed 10 Feb 2010
 Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software. Newsletter of ACM SIGKDD Explorations 11(1):10–18View ArticleGoogle Scholar
 Trevor H, Tibshirani R, Friedman RJ (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer Series in Statistics publisher, ISBN 9780387848587
 Witten I, Frank E (2011) Data mining practical machine learning tools and techniques with Java implementations, 3rd edn. Morgan Kaufmann publisher, ISBN 9780123748560 (pbk)
 Chai T, Draxler R (2014) Root mean square error (RMSE) or mean absolute error (MAE) – arguments against avoiding RMSE in the literature. Journal of Geoscience Model Development 7(1):1247–1250View ArticleGoogle Scholar