Decision making in cloud environments: an approach based on multiple-criteria decision analysis and stochastic models

Araujo, Julian; Maciel, Paulo; Andrade, Ermeson; Callou, Gustavo; Alves, Vandi; Cunha, Paulo

doi:10.1186/s13677-018-0106-7

Research
Open access
Published: 27 March 2018

Decision making in cloud environments: an approach based on multiple-criteria decision analysis and stochastic models

Julian Araujo¹,
Paulo Maciel¹,
Ermeson Andrade²,
Gustavo Callou²,
Vandi Alves¹ &
…
Paulo Cunha¹

Journal of Cloud Computing volume 7, Article number: 7 (2018) Cite this article

8317 Accesses
26 Citations
1 Altmetric
Metrics details

Abstract

Cloud computing is a paradigm that provides services through the Internet. The paradigm has been influenced by previously available technologies (for example cluster, peer-to-peer, and grid computing) and has now been adopted by almost all large organizations. Companies such as Google, Amazon, Microsoft and Facebook have made significant investments in cloud computing, and now provide services with high levels of dependability. The efficient and accurate assessment of cloud-based infrastructure is fundamental in guaranteeing both business continuity and uninterrupted public services, as much as is possible. This paper presents an approach for selecting cloud computing infrastructures, in terms of dependability and cost that best suits both company and customer needs. We use stochastic models to calculate dependability-related metrics for different cloud infrastructures. We then use a Multiple-Criteria Decision-Making (MCDM) method to rank the best cloud infrastructures, taking customer service constraints such as reliability, downtime, and cost into consideration. A case study demonstrates the practicability and usefulness of the proposed approach.

Introduction

Cloud computing has enabled the emergence of several service-oriented resources, such as Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS) [27]. The development and use of such cloud-based services have resulted in an increased number of users, and higher levels of data produced by different devices and applications. Several corporations and institutions have shown an interest in cloud computing, and because of this many cloud computing platforms have been proposed. Google, Amazon, Microsoft, and Facebook are examples of companies that are investmenting heavily in cloud computing services [30, 37].

Cloud computing has grown rapidly, and has gained popularity because it offers several benefits including on-demand self-service, virtualization, geographic distribution, and resilience [3]. These benefits are particularly attractive because they can offer flexibility guarantees for customer service constraints such as downtime and cost, which are negotiated by cloud providers through Quality of Service (QoS) guarantees. However, providing cloud services according to the customer needs and specific constraints remains a challenge. Parameters such as reliability, capacity-oriented availability and cost are relevant factors in the negotiation of such services [6, 10]. Therefore, an efficient and accurate assessment of cloud infrastructures considering availability, reliability and cost requirements is fundamental in allowing customers to identify a cloud infrastructure that suits their needs and preferences.

To provide uninterrupted cloud services, cloud managers must evaluate and improve dependability aspects of cloud infrastructures, such as availability and transaction loss; this is because users require a reasonable level of confidence in such infrastructures to efficiently plan and operate their business [19]. Some services may be considered mission-critical, and depending on the number of data operations involved, it may be essential to deploy redundancy strategies [24]. Such strategies can lead to avoidance of outage due to issues such as database deadlock, data loss, or network failure. Cloud outages can cause significant financial losses to an organization, and in extreme cases may result in the failure of the business [4].

In this context, dependability models like Reliability Block Diagrams (RBDs) and Stochastic Petri Nets (SPNs) can be useful when comparing cloud infrastructures [7, 28]. Cloud infrastructures differ one from another in many aspects, and this results in significant challenge for cloud users attempting to identify the infrastructure that best suits their needs [34, 44]. There are always trade-offs when considering different cloud alternatives; for example, robust cloud infrastructures may result in unnecessary costs to guarantee against events that are very unlikely to happen, while simple infrastructures may result in loss of critical data. MCDM methods, which consist of techniques to solve such multi-criteria problems, are essential because they can assist cloud users in choosing the best cloud infrastructure, and can take into account multiple criteria like capacity-oriented availability, reliability, downtime, or cost.

MCDM methods are designed to analyze and give recommendations on situations involving a large number of alternatives and conflicting criteria. In [33], the authors presented a case study to compare different MCDM methods in order to select IaaS services. Garg et al. [14] proposed a framework based on an Analytic Hierarchy Process (AHP) to rank cloud providers. In [23], the authors presented a multi-attribute group decision-making (MAGDM) approach for selection cloud providers. Differently from these works, we present an approach based on a MCDM method and stochastic models to evaluate, rank and find a set of optimal cloud environments considering dependability (eg.: ca-pacity-oriented availability and reliability), and cost requirements.

The process of choosing a cloud infrastructure can be slow, tedious and costly; it can also generate conflicts of interest considering a set of alternatives. In recognition of the importance of making an appropriate cloud infrastructure choice, we propose a novel approach which implements an Multiple-Criteria Decision-Making (MCDM) method to rank the best infrastructure, and takes customer service constraints such as dependability and cost into consideration. Although there are several methods for multi-criteria decision-making, we adopted the Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) method [11] due to its simplicity and easiness to apply. The results allow cloud customers to identify and choose the cloud infrastructure that best suits their needs, in a fast and efficient manner. Specifically, our contributions are:

We design and implement a strategy that combines a decision-making method, with the capacity of stochastic models to obtain dependability-related metrics (like reliability and capacity-oriented availability).
Our modeling strategy is based on hierarchical and heterogeneous modeling for planning cloud infrastructures, which allows the evaluation of cloud infrastructures with complex redundant mechanisms and maintenance policies.
We developed a tool (called MiPACE) to support the planning of cloud infrastructures which consider customer service constraints, and assists in the decision-making process.
We demonstrate the feasibility of our approach by showing real case scenarios and identify a set of ideal cloud infrastructures.

The remaining sections are organized as follows. “Background” section describes some general concepts. “Related work” section presents the related work. “Adopted strategy and base cloud architecture” section shows the proposed approach for ranking cloud infrastructures according to customers needs, and details the base cloud architecture adopted by this paper. “Hierarchical models and cost equations” section illustrates the hierarchical modeling process and cost equations. “MiPACE: a multi-criteria tool for planning and analysis of cloud environments” section presents the developed tool used to support the decision-making process. “Results and discussion” section illustrates the proposed approach through a real-world case study. “Final remarks” section concludes the paper and presents future directions.

Background

This section introduces fundamental concepts on multiple-criteria decision-making, dependability modeling, stochastic Petri net, and reliability block diagram.

Multiple-criteria decision-making

In our daily lives, we do not consider just one criterion when making a decision, but rather compare and evaluate more than one alternative simultaneously. When purchasing a cloud service for example, security, processing power, networking throughput, and storage capacity may all be considered as main criteria. It would be unusual for the cheapest cloud service to have the highest reliability and unlimited storage, and it is necessary to evaluate all potential impact when making decisions that involve long-term commitment and budget allocation. Thus, companies must consider multiple criteria when determing the best cost-benefit ratio. Multiple-Criteria Decision-Making (MCDM) methods have been developed to support the decision-making process in solutions that exhibit multiple conflicting criteria, and thus provide techniques for finding a set of optimal solutions.

A large number of MCDM techniques have been proposed, each with different perspectives and theories. Some techniques are used to solve ranking problems, such as the Analytic Hierarchy Process (AHP), Analytic Network Process (ANP), Elimination and Choice Expressing Reality (ELECTRE III), and Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) approaches [11]. Other approaches adopted monitoring and historical data combining with ranking techniques for decision making [15]. In this paper, the concept of the TOPSIS method has been adopted for ranking cloud infrastructures. TOPSIS is a useful technique for ranking and selecting a number of externally determined alternatives, using distance measures such as Euclidean, Manhattan and Minkowski [39]. These distances measures order alternative solutions from the best to the worst by means of scores or pairwise comparisons. They are based on five stages, where the first step groups the set of alternative solutions when taking the defined criteria into account. In the second step, the values representing each criterion are normalized; this allows all criteria to be treated in a similar way, independent of the metric adopted. Next, the criteria can be weighted and the distances between each solution are calculated, taking an anti-ideal and an ideal point (optimal solution) into consideration. In the fourth step, the relative closeness to the ideal solution is calculated. In the last step, a set of alternatives is ranked according to the relative closeness [18].

Dependability modeling and evaluation

The dependability [21] of a system is defined as its justifiably trusted ability to deliver a set of services. Dependability requirements encompass the concepts of reliability, availability, maintainability, performability, and testability. This paper focuses on availability and reliability modeling, and analysis of cloud infrastructures. Availability is the probability that the system is working (even if not at its full capacity) over time, whereas reliability is the probability that the system will deliver a set of services over a given period of time [21, 25]. The steady state availability (A) may be calculated by Eq. 1:

$$ A = \frac{MTTF}{MTTF + MTTR} $$

(1)

where MTTF and MTTR denote the mean time to failure and mean time to repair, respectively.

For any given time period represented by the interval (0,t), R(t) is the probability that the component has continued to function (i.e. has not failed) from 0 until t. When an exponentially distributed time to failure (TTF) is considered, reliability is represented by

$$ R(t)= exp \left[ -\int_{0}^{t} \lambda \left(t^{'}\right)dt^{'} \right] $$

(2)

where $\phantom {\dot {i}\!}\lambda (t^{'})$ is the instantaneous failure rate.

We also adopt the Capacity-Oriented Availability (COA) as [25, 42]. COA takes into account how much of a service provided by a system is delivering, therefore, does not consider only states of availability or unavailability, but the impact of these conditions in service delivery. The COA calculation considers pc_i as operational processing capacity or the amount of resource available at any state s_i. π_i is the probability of being at state s_i∈S, where S is the set of reachable states. And the maximum capacity of the system is N. Thus, we can calculate the COA by Eq. 3.

$$ COA = \frac{\sum_{s_{i}\in S}~pc_{i} \times \pi_{i} }{N} $$

(3)

Redundancy techniques

In several application domains, different techniques have been adopted to increase the dependability of systems. These techniques are traditionally classified into four groups: fault prevention, fault removal, fault forecasting, and fault tolerance [38]. Unlike other techniques, fault tolerance (redundancy) aims to provide correct service delivery even in the presence of faults. Redundancy refers to extra resources that are not necessary for the execution of the faultless task, but must be applied if faults occur to guarantee the service delivery.

The redundancy techniques for fault tolerance include active-standby and active-active redundancy [5]. In an active-active redundancy mechanism, both the main elements (e.g., resource and service) and the redundant elements are permanently active. The users do not perceive the occurrence of faults, nor does performance degradation take place. In contrast, active-standby mechanisms are characterized by fault detection followed by recovery actions, which require extra processing time. This type of strategy uses two component types: active, and standby. The active module usually provides the service for all environments; if the active module fails, the standby component assumes control. Standby modules are classified as hot, warm, or cold, depending on the level of service restoration [24].

Stochastic Petri net

Petri nets are very well suited for modeling several system types. This is because concurrency, synchronization, communication mechanisms, and deterministic and probabilistic delays, are naturally represented. In general, Petri nets are a bipartite directed graph, in which places (represented by circles) denote local states, and transitions (depicted as rectangles) represent actions. Arcs (directed edges) connect places to transitions, and vice versa. The original Petri Net does not have the notion of time for analyzing performance and dependability; the introduction of event durations results in a timed Petri Net.

Stochastic Petri nets (SPN) [26] is a special type of timed Petri Net, which allows the association of probabilistic delays with transition, by using exponential distribution. It is a high-level model which allows automatically to generate and evaluate Continuos Time Markov Chain (CTMC) [42]. This carateristic is particularly useful when the system’s state space is large and/or system component’s interactions are complex. Besides, SPN may also be evaluated through simulation. Simulation may be the alternative when non phase-type distribution [42] are required and/or the system state space is infinity.

Reliability block diagram

A Reliability Block Diagram (RBD) [12] is a combinatorial model, initially proposed as a technique for calculating the reliability of a system by using block diagrams. The technique has also been extended to calculate other dependability metrics, such as availability [21, 24, 25]. RBD may be a model of choice for computing availability and reliability related metrics for passive redundant mechanism and/or independent component systems [25]. In RBD, model are usually obtained by serial and parallel composition of components and subsystems.

In an arrangement series, the whole system is no longer operational if a single component fails. This means that all components must be operational for the serial system to succeed. If a system with n independent components is considered, the reliability (instantaneous availability or steady-state availability) is obtained by the product of component’s reliabilities (instantaneous availability or steady-state availability). In a parallel arrangement, the whole system is considered operational even if only a single component is operational, because there are a total of n possible success paths. For a system with n independent components, the unreliability (instantaneous unavailability or steady-state unavailability) is obtained by the product of component’s unreliability (instantaneous unavailability or steady-state unavailability). k-out-n redundancy may also be represented by RBDs. k-out-n RBD models allows you to represent more general compositions than simple series or parallel configurations. Actually, simple series or parallel configurations are special cases of k-out-n compositions [24, 25, 31, 42].

Failure critical index

In general, component importance ranking indicates the impact of a particular component on the overall system reliability. Based on certain system characteristics, various measures are calculated to estimate the component importance, and this often relates the contribution of a component to the system failure. Birnbaum introduced this concept, which can be considered one of the most widely used reliability importance indices [21]. The Birnbaum importance of a component i is equal to the degree of improvement in system reliability, when the reliability of the component is increased by one unit [21]. In other words, RI (reliability importance) is a partial derivative of system reliability with respect to the failure rate of each individual component [17].

The RI of component i can be computed as

$$ I^{B}_{i} = R_{s}\left(1_{i},\textbf{p}^{i}\right) - R_{s}\left(0_{i},\textbf{p}^{i}\right) $$

(4)

where $I^{B}_{i}$ is the reliability importance of component i, pⁱ is the component reliability vector with the ith component removed, 0_i represents the condition when component i fails, and 1_i describes the condition when i is working.

$I^{B}_{i}$ depends on the structure of the system and the reliability of the other components. The RI of a component i is determined by the reliability of the other components, excluding i [32].

Related work

The increasing number of cloud platforms added to the competition among various cloud providers, has resulted in a situation whereby customers may find selection of a dependable and cost-effective cloud infrastructure difficult. In this context, several approaches have been proposed to assist cloud customers with identification of a suitable cloud infrastructure.

Rehman [35] presented a cloud selection approach based on historical QoS to rank cloud services; the proposed approach captures variations in each time-slot, and a service selection decision is then made. All decisions are then aggregated to find the best option. Lee et al. [22] proposed a hybrid multi-criteria decision-making model for a cloud service selection problem using balanced scorecard (BSC), fuzzy Delphi method (FDM) and fuzzy analytical hierarchy process (FAHP). Sachdeva et al. [36] combined a hybrid TOPSIS method with an intuitionistic fuzzy set to select appropriate cloud solutions to manage big data projects in a group decision-making environment. Garg et al. [14] proposed a framework called SMICloud, which compares different cloud providers based on user requirements. The framework considers a set of attributes (e.g. accountability, agility, assurance of service, performance, cost, and usability) when prioritizing and ranking the best services, with the ranking mechanism based on an Analytic Hierarchy Process (AHP). Liu et al. [23] presented a multi-attribute group decision-making (MAGDM) approach for the process of choosing an adequate cloud service vendor. This approach considered objective attributes (i.e., cost and time), as well as subjective attributes such as TOE (Technology, Organization, and Environment). To demonstrate the usefulness of the approach, a hypothetical example was given. Differently from these works that use data from other works or estimate data from case studies to ranking, we use the results obtained from the proposed hybrid approach to generate a set of optimal solutions. The approach combines RBD and SPN models to represent and evaluate cloud infrastructures with complex redundant mechanisms and maintenance policies.

Other work has evaluated the dependability of cloud infrastructures. Wei et al. [45] presented a hierarchical approach that combines reliability block diagrams and general stochastic Petri nets in evaluating the dependability of a virtual data center. Andrade et al. [1] developed a framework for transforming elements of SysML diagrams into deterministic and stochastic Petri nets. The work focused on modeling and analysis of cloud service management, with the aim of maximizing the use of cloud computing resources at the lowest possible cost. Sousa et al. [41] proposed a modeling strategy for planning cloud infrastructure when considering dependability and cost requirements. This approach is based on hierarchical and heterogeneous modeling that combines combinatorial and state-space models to represent and evaluate cloud infrastructures. Dantas et al. [9] described stochastic models for evaluating private cloud architectures, and presented a comparative cost study of public and private cloud providers. Despite the fact that these works are interesting, they only concerned with evaluating scenarios and presenting a comparison of the quantitative results.

The works discussed above has attempted to solve one side of the problem only: either the cloud selection perspective, or the dependability and cost evaluation aspect. However, there are various kinds of cloud environments with conflicting requirements that needs scientific approaches to judge which one should be chosen. To fill this research gap, a strategy that combines the need to rank different service constraints with the capacity to study alternative cloud infrastructures, is presented.

Adopted strategy and base cloud architecture

This section first introduces the proposed strategy for modeling, analysis, and ranking of cloud computing environments. After that, the base cloud architecture adopted for this study is detailed.

A strategy for decision making in cloud computing environments

The strategy is based on stochastic models and an MCDM method to rank a set of cloud infrastructures, taking into account availability, capacity-oriented availability, reliability and cost requirements. The proposed strategy can be used by service providers or individuals who are interested in building and selecting their own cloud computing environments. Figure 1 illustrates the proposed strategy. The macro activities consist of (i) experimental design, (ii) creation of availability and cost models, (iii) assessment, and (iv) the decision-making process.

Experimental design (i): This activity comprises two steps: defining the base cloud environment, and designing the experiment. The first step determines the nature of the cloud environment in terms of its components and their interactions, and defines a base cloud environment. The experiment is designed to investigate the effects of variations in one or more parameters on the base cloud environment. This is accomplished by generating distinct scenarios from the base cloud environment (e.g. redundant nodes and repairing service), and investigating the impact of these modifications on adopted metrics such as reliability and capacity-oriented availability.

Create dependability and cost models (ii): The modeling strategy comprises two steps: creation of models, and hierarchical composition. The first step aims to identify a set of individual components from the base infrastructure to be modeled through RBDs. These models are useful to analyze the reliability/availability of simple and complex systems. They can also be used to model variations in the base architecture defined from the design of experiments, such as the redundancy of nodes or virtual machines. Nevertheless, RBDs cannot easily handle detailed failure/repair behavior, and SPNs are therefore adopted to model complex redundant mechanisms and maintenance policies. We combine the strongest advantages of these models to perform the analysis and constitutes a hierarchical model. We also propose equations for estimating the costs of the cloud environment, considering associated maintenance and operational costs.

Assessment (iii): This is a macro activity in which a hierarchical model, comprising RBD and SPN models, is used to evaluate the impact that different redundant mechanisms and maintenance policies have on the steady-state availability and reliability of an environment. The hierarchical model solution is computed by passing the outputs of the SPN models (the lower-level sub-models) as inputs to the higher level sub-models represented by the RBDs. Cost equations are also solved, to estimate the cost of the infrastructure under analysis. The results obtained from the dependability models and cost equations are then used in the next step to assist in the decision-making process.

Decision making (iv): At this stage, the cloud infrastructures are ranked based on any of the following distance measures: Euclidean, Manhattan and Minkowski. First, it is defined a criteria (e.g.: availability or cost) and objectives which can be minimize or maximize the criteria previously defined. The weights of each criterion according to the decision maker’s preference are then defined. Lastly, a set of alternative solutions is ranked, based on a distance measure chosen. If the results are satisfactory for the desired criteria, the proposed strategy is complete. Otherwise, adjustments are made in the criteria, and the macro activity steps are repeated. Note that this macro step is automated, and the tool developed for this is described in “MiPACE: a multi-criteria tool for planning and analysis of cloud environments” section.

The cloud environment

The base cloud architecture used for this study is depicted in Fig. 2, and comprises three main components: the main node, standby node, and the front-end. The main node consists of a virtual machine (VM) hosted on physical hardware (Hw). The virtual machine is represented by an operating system (OS) and an application service (APP). The application running in the VM is a digital library service. It should be noted, however, that the hardware in the main node supports an OS, a management server (Mng) and a VM. The management server executes the cloud services in the operating system. The standby node is used to ensure high levels of availability, and it assumes the role of the main node when a failure occurs; this node has the same components as the main node. The front-end is responsible for supervising and controlling the entire cloud environment through a specific cloud management tool. It is important to highlight that the remote storage volume can be accessed by the VMs, and is managed through the front-end. All of the components are interconnected by a private network. Note that from the base cloud architecture, more complex scenarios were considered based on the strategy described above and are described in the results and discussion section.

The cloud operational mode is described as follows. The main node (and its VM) and the front-end must be in working order for the system to be operational. However, if the standby and main nodes fail, the cloud becomes unavailable. The roles of the standby and main nodes are swapped when the VM is restored. The objective of the standby node is to maximize the availability of the cloud infrastructure, which can be established through a Service Level Agreement (SLA).

Hierarchical models and cost equations

This section describes the hierarchical models designed to represent the base cloud environment previously presented (see Fig. 2). RBDs are used to represent the dependability relationship between independent subsystems, while detailed or more complex fail and repair mechanisms are modeled using SPNs. This approach enables the representation of many kinds of dependency between components, and avoids the well-known issue of state-space explosion [43]. Furthermore, this section also presents the proposed equations for estimating the cloud environment costs, which consider associated maintenance and operational costs.

Availability models for the base cloud environment

A hierarchical model was created to compute dependabi-lity-related metrics for the cloud environment described in “The cloud environment” section. Assuming cloud environments only, the architecture can be divided into three sub-models: front-end, main node, and standby node. The base cloud environment illustrated in Fig. 2 is modeled through RBDs and the respectively availability (reliability) is shown as:

$$ P_{s} = P_{PE} \times (1 - (1 - P_{mn})(1 - P_{sn})), $$

(5)

where P_PE, P_mn and P_sn is the front-end, main node and standby node availability (reliability), respectively.

Equation 6 computes the availability for the front-end sub-model, which is composed of three components connected in series: hardware, operating system, and management server. The front-end component is responsible for identifying and managing the underlying virtualized resources (i.e., the servers, network, and storage). The hardware component corresponds to the physical parts of a computer system (i.e., the memory, CPU, network, etc.). The cloud OS primarily manages the operation of one or more virtual machines within a virtualized environment, while the management server executes the cloud services in the operating system.

$$ P_{s} = P_{Hw} \times P_{OS} \times P_{Mg} $$

(6)

where P_Hw, P_OS and P_Mg is the hardware, operating system, and management server availability (reliability), respectively.

Equation 7 computes the availability for the main node. This node represents the computer resources for the deployment of virtual machines, and is composed of five components in series: hardware, operating system, management server, virtual machine, and service. Similar to the main node, the standby node is composed of five components in series: hardware, operating system, management server, virtual machine, and service. We assumed that the components of the standby node have the same dependability characteristics as the main node; i.e., the same MTTFs and MTTRs.

$$ P_{s} = P_{Hw} \times P_{OS} \times P_{Mg} \times P_{Mg} \times P_{Vm} \times P_{Sv} $$

(7)

where P_Hw, P_OS and P_Mg is the hardware, operating system, management server, virtual machine, and service availability (reliability), respectively.

The availability model representing the base cloud environment depicted in Eq. 5 operates in a hot-standby redundancy configuration (indicated by the parallel configuration). That is, when the main node fails, the redundant component replaces it without a delay in activation. This type of redundancy improves the system availability, because when the main node fails, the hot-standby node automatically takes its place. Nevertheless, RBDs equations cannot easily handle detailed failure/repair behavior. The warm-standby and cold-standby replication mechanisms cannot be fully represented in RBD models, due to the dependency between states of components. Therefore, such mechanisms are represented by SPNs. More specifically, in this paper the warm-standby and cold-standby replication mechanisms are adopted for the main node and virtual machines components; Fig. 3 presents an example of an SPN model for a node with cold-standby redundancy. Note that the hierarchical model solution is computed by passing the outputs of lower-level sub-models as inputs to the higher level sub-models. For example, the results from an SPN model representing a redundant VM are passed as values to the base cloud environment model. The base model is then solved to compute dependability metrics.

Cold standby model

A component with cold standby redundancy is based on a nonactive spare module that waits to be activated when the (main) active module fails. Hence, when the main module fails, the spare module’s activation takes a certain amount of time to be activated. This time period is named mean time to activate (TACT). As the spare component is switched off, it is considered that it does not fail until becoming operational.

Figure 4 depicts an SPN model that illustrates this mechanism. The model uses two virtual machines in four possible places: VM1_ON, VM1_OFF, VM2_ON, and VM2_OFF. The places represent the operational and failure states for both main and spare modules. The spare module (VM2) is initially deactivated, so no tokens are stored in places VM2_ON and VM2_OFF. When the main module fails (VM1), the transition TACT is fired, and consequently the spare module is activated. The immediate transition D_VM2 represents the deactivation of the spare module when the main module is recovered. This redundancy mechanism fails if both modules fail. Thus, the operating mode can be expressed as

$$ Cold_{Operational}~=~(\texttt{VM1\_{ON}=1~OR~VM2\_{ON}=1}) $$

(8)

where a token in the places VM1_ON or VM2_ON, defines the operational state of the environment.

Warm standby model

A component with warm standby redundancy is based on a nonactive spare module that waits to be activated when the active module fails. The difference with the cold standby redundancy is that the active and spare modules have failure rates λ and spare module has a failure rate ϕ when it is de-energized, considering 0≤ϕ≤λ.

Figure 5 illustrates an SPN warm standby model. The warm standby model has an active module with a full failure rate λ_F (1/MTTF_VM1), and the standby is operating with a reduced failure rate α_F (1/MTTF_OPVM1). This redundancy mechanism has the spare module configured, but unavailable; it also ensures that the environment has continuously mirrored data. The spare module is activated in the presence of a fault in the environment, and consequently the time before activation will be shorter than in the cold standby approach.

When the main module fails, the secondary module is fully activated and replaces the faulty main component. The transition TACT represents the activation event. Places VM1_ON, and VM1_OFF represent the operational and non-operational states of the main module. Places OPVM2_ON and OPVM2_OFF represent the spare module in the operational state when not available. Places VM2_ON and VM2_OFF represent the situation in which the secondary module fails before being activated, because it is regularly synchronized with the main module. When the main module fails, the transition TACT is fired to activate the spare module, similarly to the cold redundancy. The immediate transition is named D_VM2, and has the same behavior in cold standby. The entire model fails if both modules fail. Thus, the operating mode can be expressed as

$$ {\begin{aligned} Warm_{Operational}~=~(\texttt{VM1\_{ON}=1~AND~OPVM2\_{ON}=1}) \end{aligned}} $$

(9)

where a token in places VM1_ON or OPVM2_ON represents the operational state of the environment.

Figure 6 depicts the SPN active-active (A/A) redundancy model. From this model, it is possible to estimate the capacity-oriented availability considering the service running on a set of VMs that are hosted an node. The NVM and NND parameters allow such representation, where, n>1. The places VM_ON, ND_ON, VM_OFF, and ND_OFF represent the operational and failure states for both VMs and Nodes. The transition DE is activated when there are no tokens in place ND_ON, that is, when all nodes fail. Thus, VMs will be failing if they fail, or when all nodes fail. The VM_DW place represents the failure state of the VMs when all nodes are faulted. The RVM transition represents the return of the VMs to the operational state since the nodes have been repaired.

Equation 10 presents the COA calculation for active-active model considering a scenario with two virtual machines and one physical node, i.e., NVM = 2 and NND = 1.

$$ {\begin{aligned} COA&=((\texttt{P}\{(\#\texttt{VM1\_ON})=(1 \times \texttt{NVM})\}~\times (1 \times \texttt{NVM}))\\ &\quad + (\texttt{P}\{(\#\texttt{VM1\_ON})=((1 \times \texttt{NVM})-1)\} \\ & \quad \times ((1 \times \texttt{NVM})-1)))/(1 \times \texttt{NVM}) \end{aligned}} $$

(10)

Cost model

The cost model uses the concept of Total Cost of Ownership (TCO) for evaluating and comparing the costs of cloud computing environments. TCO is the process of identifying costs categories other that price, transport, and operational [29, 46]. From the details of each experiment described above (such as the number of nodes, and service availability and unavailability), we allocate a period of time in which to estimate the total cost of each cloud environment under study. The estimate includes the cost of maintenance, operation, and rent of the cloud environment. Equation 11 estimates the total cost of the infrastructure.

$$ TCe = Cr + Cm + Cop $$

(11)

Cr, which is represented by Eq. 12, allows determination of the costs associated with the rent of the cloud infrastructure. $\sum Lc$ represents the monetary value paid for the components that make up the infrastructure, that is, the amount of investment made in equipment and facilities to keep the infrastructure in operation. N is the number of nodes deployed, T is the assumed time period, and Av is the availability of the infrastructure as a service.

$$ Cr = \sum Lc \times N \times T \times Av $$

(12)

Equation 13 is used to estimate the maintenance costs (represented by Cm). Dwt is the downtime period. Lb_Dw represents the maintenance labor cost per hour when a failure occurs. Sf is a service factor; i.e., customers may pay more or less depending on the contracted service, which affect the priority level for problem resolution. N and VM represent the number of nodes and virtual machines allocated by the contract, respectively. T is the period of service specified in the contract, while $\sum Cr$ represents the costs related to the replacement of cloud components.

$$ Cm = (Dwt \times Lb_{Dw} \times Sf \times N \times VM \times T) + \sum Cr $$

(13)

Equation 14 represents Cop, and allows the calculation of the operational costs of the cloud environment. Ec is the energy consumption, and E_p is the electricity price. Lb_Up represents the monetary value of each hour spent on keeping the infrastructure operational, while T, Sf, Av, N, and VM represent the same parameters as presented in the previous equations.

$$ {\begin{aligned} Cop &= (Ec \times E_{p} \times N \times T \times Av)\\ & \quad + (Lb_{Up} \times Sf \times Av \times N \times VM \times T) \end{aligned}} $$

(14)

MiPACE: a multi-criteria tool for planning and analysis of cloud environments

This section is dedicated to presenting the details of the developed tool. MiPACE was developed to support the planning of cloud infrastructures which consider customer service constraints, and tool assists in the decision-making process. It allows analysts, technicians, managers, and users of cloud services to plan and analyze cloud scenarios. The tool is written in the programming language C, and the features implemented are described below.

(i)
Mercury tool: The Mercury tool [40] was developed by the MODCS research group, and allows the creation and evaluation of performance and dependability models. It implements the following formalisms: Continuous Time Markov Chains (CTMCs), Reliability Block Diagrams (RBDs), Energy Flow Models (EFMs), and Stochastic Petri nets (SPNs). The Mercury tool is used along with the MiPACE tool to create hierarchical models, and to solve the experimental study design scenarios.
(ii)
Integration module: Because MiPACE does not implement the RBD and SPN formalisms, this module was implemented to integrate the results obtained from the Mercury tool into MiPACE. That is, an input file is created with all of the results obtained from the design of experiment studies, and then uploaded into the MiPACE tool.
(iii)
Design of experiment editor: This feature allows users to plan experiments. Initially, it is necessary to choose a number of factors to be combined. The user then indicates the number of levels for each factor. Note that the tool supports the full factorial method, which involves testing every combination of factors against each other. As explained earlier, the experiment design is adopted to investigate the effects of variations of one or more parameters in the base cloud environment; we therefore generate distinct SPN models from the SPN model that represents the base cloud environment, and investigate the impact of such modifications on the adopted metrics. Thus, the purpose of this feature is to provide a set of scenarios that will be modeled and analyzed by the Mercury tool.
(iv)
Ranking generator: When the results obtained from the experiment study designs have been uploaded into MiPACE, this tool then ranks a set of optimal solutions. At this stage, the user of the tool must define the criteria function (e.g. availability or cost) and the objective which be minimized or maximized the criteria previously defined. The user can then choose the distance measure for ranking the solutions, such as the Euclidean, Manhattan or Minkowski distances [16]. These distance measures are used for similarity comparisons. Finally, the user can add weights to the criteria function previously defined to prioritize one variable over another; for example, the user could use this option to prioritize cost over high availability.
(v)
Report tool: The results that consider each criterion are displayed in the tool panel. MiPACE also generates two output files containing the ranking of the architectures, with one output file used for visualization and the other for plotting purposes. If necessary, the user can change the criterion function or objective, and repeat the ranking step.

Results and discussion

This section discusses a case study to illustrate the applicability of the proposed approach when considering availability, capacity-oriented availability, reliability and cost requirements. The approach assists individuals in identifying an ideal cloud infrastructure, and takes service constraints into account. The availability and cost models are useful during the design and analysis of cloud infrastructures, because they represent the characteristics of cloud environments. The results obtained by the evaluation of these models serve as the input to the MiPACE tool, which then finds a set of optimal solutions.

Evaluation of the base cloud environment

The first part of this case study aims to demonstrate the applicability of the availability models, and presents the results obtained for the base cloud environment. The base cloud environment represented in Fig. 2 was modeled (shown in Eqs. 6 and 7), and the availability models combined to represent the whole cloud infrastructure. Equation 5 illustrates the RBD model for the base cloud environment. The reliability importance index^{Footnote 1} (RI) was the adopted to identify which component of the system required further attention to increase the availability level. Assuming only the Front-End and Main nodes of the devices present in Fig. 2, the RI index for the nodes was, 0.153201 and 0.219544, respectively. The main node is the most critical component, and it is most important when adopting a redundancy mechanism. Three redundancy strategies hot, cold, and warm were used to increase the availability levels, and these mechanisms are presented in Eqs. 4 and 5, respectively.

Table 1 shows the parameter values adopted for estimating the cost of the cloud environment. The E_p and Ec parameters represent the energy price (in USD) and the energy consumption per kilowatt-hour [13], respectively. Such parameters only take the servers into account. The Lb_Up (operation) and Lb_Dw (maintenance) parameters indicate the labor cost per hour, while R_t represents the rental rate for the cloud infrastructure. The type of service is categorized as gold, silver, or bronze, and these reflect the capacity of the cloud maintenance team to support different quality levels; a reduction of 10% in the mean time to repair the silver service in comparison to the gold service assumed, with a reduction of 20% in the mean time to repair the bronze service in comparison to the gold service.

Table 1 Cost parameters

Decision making in cloud environments: an approach based on multiple-criteria decision analysis and stochastic models

Abstract

Introduction

Background

Multiple-criteria decision-making

Dependability modeling and evaluation

Redundancy techniques

Stochastic Petri net

Reliability block diagram

Failure critical index

Related work

Adopted strategy and base cloud architecture

A strategy for decision making in cloud computing environments

The cloud environment

Hierarchical models and cost equations

Availability models for the base cloud environment

Cold standby model

Warm standby model

Cost model

MiPACE: a multi-criteria tool for planning and analysis of cloud environments

Results and discussion

Evaluation of the base cloud environment

Planning for design of experiments

First case scenario

Second case scenario

Third case scenario

Final remarks

Notes

Abbreviations

References

Acknowledgements

Funding

Availability of data and materials

Authors’ information

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords