As we found in this systematic review, Cloud providers can make use of several technologies and mechanisms to offer HA services. Authors in [9] classify HA solutions into two categories: middleware approaches and virtualization-based approaches. They propose a framework to evaluate VM availability against three types of failures: a) application failure, b) VM failure, and c) host failure. Authors use OpenStack, Pacemaker, OpenSAF, and VMware to apply their framework, which considers stateful and stateless-HA applications.
However, in our research, we organize solutions into three layers (underlying technologies, services, and middlewares), and keep in mind that layers can be composed of (one or many) solutions from bottom layers to perform their goals (Fig. 6).
Our classification is a simplified view of the framework proposed by Service Availability Forum (SAForum) (Fig. 7). SAForum is focused on producing open specifications to address the requirements of availability, reliability and dependability for a broad range of applications (not only Clouds).
There are three types of services in its Application Interface Specification (AIS): Management Services, Platform Services, and Utility Services. According to [10], Management Services provide the basic standard management interfaces that should be used for the implementation of all services and applications. Platform Services provide a higher-level abstraction of the hardware platform and operating systems to the other services and applications. Utility Services provide some of the common interfaces required in highly available distributed systems, such as checkpoint and message.
SAF also proposes two frameworks: Software Management Framework (SMF), which is used for managing middleware and application software during upgrades while taking service availability into account; and Availability Management Framework (AMF), which provides functions (e.g. a set of APIs) for availability management of applications and middleware [10], such as component registration and life cycle management, error reporting and health monitoring.
We understand our 3-layer classification covers the SAF framework, because SAF specifications can be allocated between our layers. The next sub-sections will present solutions found in our systematic review focusing on services layer.
Underlying technologies
The bottom layer is a set of underlying technologies that enable a Cloud provider offering a plethora of possibilities to provide high availability using commodity systems.
Virtualization is not a new concept but Cloud providers use it as key technology for enabling infrastructure operation and easy management. According to [11], the main factor that increased the adoption of server virtualization within Cloud Computing is the flexibility regarding reallocation of workloads across the physical resources offered by virtualization. Such flexibility allows, for instance, for Cloud providers to execute maintenance without stopping developers’ applications (that are running on VMs) and to implement strategies for better resource usage through the migration of VMs. Also, server virtualization is adapted for the fast provisioning of new VMs through the use of templates, which enables providers to offer elasticity services for application developers [12].
Virtualization can also be used to implement HA mechanisms at the VM level, such as failure and attack isolation, checkpoint and rollback as recovery mechanisms. Beyond that, virtualization can also be used at the network level with the same objectives by virtualizing network functions (see about Network Function Virtualization (NFV) in [13]).
There are several hypervisor options, such as from the open-source community, Xen8 and Kernel-based Virtual Machine (KVM)9. As well, there are those from proprietary solutions, including VMWare10 and Microsoft’s HyperV11.
Services
The second layer is composed of many services that can be implemented and configured according to Cloud provider requirements or management decisions. For instance, if a provider has a checkpoint mechanism implemented in its infrastructure, it should configure the checkpoint service, which could mean setting it as an active or a passive checkpoint, and configuring the update frequency, for instance. The next subsections describe the main services and report how related studies used them.
Redundancy
The redundancy service can offer different levels of availability depending on the redundancy model, the redundancy strategy, and the redundancy scope (Fig. 8).
The redundancy model refers to the many different ways HA systems can combine active and standby replicas of hosted applications. AMF describes four models: 2N, N+M, Nway, and Nway active [14]. The 2N ensures one standby replica for each active application.
The N+M model is an extension of the 2N model and ensures that more than two system units (meaning a virtual machine, for instance) can handle taking active or standby assignments from an application. N represents the number of units able to handle active assignments and M represents those with standby assignments. It is important to notice that, considering the N+M model, a unit that handles active assignments will never handle standby assignments.
Furthermore, the N-way is similar to the N+M model with the difference that it allows in the N-way model unit to handle both active and standby assignments from diverse applications instances.
Lastly, the N-way Active redundancy model comprehends only active assignments from unit applications; it does not allow standby assignments, but permits an application instance to be allocated as active into various units. Due to its simplicity, the 2N model is preferred in terms of implementation [15, 16].
The redundancy strategy is divided in two classes: active and passive redundancy [17]. In active strategy, there are no standby replicas and all application replicas work in parallel. When one node fails, tasks executing at the failed node can be resumed in any remaining node. In passive redundancy, there is one working replica whereas remaining replicas are standby. When the main node fails, any standby replica can resume failed node tasks. Please note that this active strategy helps to provide load balancing to applications. However, maintaining consistency in the passive model is simpler, and so this strategy is used in different proposals [15].
In respect to scope, one can replicate the application itself, the VM that hosts the application, or the complete physical server hosting the application. Authors in [15] propose to use all these approaches in a model-based framework to select and configure High Availability mechanisms for a cloud application. The framework constructs a model of the running system and selects the proper HA services according to the benefits and costs of each service, as well as the required availability level. In contrast, the proposal described in [16] focuses on the VM scope only.
Data replication
Data replication is used to maintain state consistency between replicas. The main problem associated with this service is the question of how to govern the trade-off between consistency and resource usage [18]. In Clouds, the replication may be achieved either by copying the state of a system (checkpoint) or by replaying input to all replicas (lock-step based) [16] (see Fig. 9).
The lock-step strategy is also called “State Machine Replication” and its main goal is to send the same operations to be executed by all replicas of an application in a coordinated way, thus guaranteeing message order and state. This strategy can be found in the TClouds plataform [19], which is applied to the state maintenance of application replicas and is also applied to maintain the consistency of objects stored in a set of cloud storage services. The same strategy is applied in the Cloud-Niagara middleware [20] in order to offer a monitoring service to check resource usage and send failure notifications with minimal delay. Following this same strategy, Perez-Sorrosal et al. [21] propose a multi-version database cache framework to support elastic replication of multi-tier stateless and statefull applications. In this framework, application and database tiers are installed at each replica and a multicast protocol maintains data consistency between replicas. The main focus of this proposal is elasticity, but the solution can also cope with failures since the replication protocol uses virtual synchrony to guarantee the reliable execution of the replicas.
Checkpoint-based replication involves propagating frequent updates of an active application to its standby replicas. It is desirable that an application have some checkpoint replicas distributed over different entities to increase reliability, guarding it against failures [10]. Checkpoint service can be implemented in a centralized fashion, when all checkpoint replicas are allocated to the same entity, and in a distributed one, where replicas are located in different entities of a cluster.
Remus is a production level solution implemented at Xen to offer High Availability following this strategy [22]. Authors of that solution point out that lock-step replication results in an unacceptable resource usage overhead because communication between applications must be accurately tracked and propagated to all replicas. In contrast, checkpoints between active and standby replicas occurs periodically, in intervals of milliseconds, providing better tradeoff between resource usage overhead and updates. Taking a similar approach, Chan and Chieu [23] introduce a cost effective solution which utilizes VM snapshots coupled with a smart, on-demand snapshot collection mechanism to provide an HA in the virtualization environment. The main idea behind this proposal is to extend the snapshot service (a common service offered by virtualized infrastructures) to include checkpoint data of a VM.
While Remus and similar approaches fit well to IaaS Clouds because they provide an application-agnostic VM-based checkpoint, Kanso and Lemieux [7] argue that in a PaaS Cloud the checkpoint service must be performed at the application level in order to cope with internal application failures that may remain unnoticed in a VM-based HA system. Therefore, the authors propose that each application send its current state to the HA system through a well-defined checkpoint interface.
In [24], authors propose BlobCR, a checkpoint framework for High Performance Computing (HPC) applications on IaaS. Their approach is directed at both application and process checkpoint levels through a distributed checkpoint repository.
In [16] authors present a solution focusing on HA for real-time applications. The middleware proposed is derived from others technologies, such as Remus, Xen and OpenNebula. For instance, continuous-checkpoint, in which asynchronous checkpoints are made in a security VM to provide HA in case of failures, was inherited from Remus.
Monitoring
Monitoring is a crucial service in an HA Cloud. Through this service, applications’ health is continuously observed to support others services. The primary goal of this service is to detect when a replica is down, but robust implementations can also follow the health indicators of an application (CPU and memory utilization, disk, and network I/O, time to respond requests) which will help to detect when a replica is malfunctioning [17]. It can also be done at virtual and physical machine level (Fig. 10).
Papers surveyed showed there are two basic types of monitoring: push-based monitoring and polling-based monitoring. The latter is the most common type of monitoring and involves a set of measuring controllers periodically sending an echo-signal to the hosted applications. This check can be sent to the operating system that hosts the application (through standard network protocols like ICMP or SNMP) or directly to the application through a communication protocol, e.g., HTTP in the case of web applications [17].
Polling-based monitoring can also be sent from a backup replica to an active replica in order to check its status and to automatically convert it from backup to active when necessary [15] and [20]. This type of monitoring can be made by a monitoring agent that is external to the application or an agent can be implemented directly in the application by a standardized API that handles messages sent by the Cloud. Through this intrusive approach the internal state of the applications can be monitored, enabling the earlier detection of adverse conditions and making it possible to offer services such as checkpointing [7].
Push-based monitoring consists of the application (or a cloud monitoring agent deployed with the application) being the one responsible for sending messages to the measuring controller, when necessary. In this case, the controller is informed when a meaningful change occurs in the monitored application [25]. Push-based monitoring can also be implemented following a publish/subscribe communication model. This type of monitoring is employed by Behl et al. [26] to provide fault-tolerance to web service workflows. The fault monitoring is implemented through ZooKeeper’s Watches, which are registered to check if a Zookeper’s ephemeral node (an application in this case) is active. In the case of failure, the monitoring controller is notified about the crash. An et al. [16] point out that the highly dynamic environment of cloud computing requires timely decisions that can be achieved by publish/subscribe monitoring. In this case, the monitoring controllers are subscribers and the monitoring agents are publishers.
One important aspect to observe is that both approaches (push and poll) can be implemented in a Cloud environment. The high availability platform proposed by Chan and Chieu [23] uses polling to check periodically for host failures, and monitoring agents running in the hosts push notifications to the monitoring controller. An et al. [16] propose a hierarchical monitoring strategy combining the publish/subscribe communication model for global-level monitoring with polling at the local level.
Failure detection
Failure detection is an important service contained in most HA solutions, which aims to identify systems’ faults (application, virtual or physical machine level) and provide needed information for services capable of treating problems to maintain service continuity (Fig. 11).
In [17] the authors list some mechanisms used to detect faults like ping, heartbeat and exceptions. From this perspective, failure detection can be classified in two categories according to detection mechanisms: reactive [23, 26]) and proactive [20]. The first approach waits for KEEP ALIVE messages, but it identifies a failure after a period of time waiting without any KEEP ALIVE message. The second approach is more robust and is capable of identifying abnormal behaviors in the environment, checking the monitoring service and interpreting collected data to verify whether there are failures or not.
For simplicity, the reactive type is implemented more often. The work presented in [26] proposes a fault-tolerant service through replication processes with BPEL implementation, which means that Zookeeper is responsible for detecting crashed replicas using a callback mechanism called watches. As well [23], authors treat failure detection through heartbeats hosted in each node, and so the absence of heartbeats after a period of time has passed indicates a failure and hence the recovery process begins.
Authors in [20] propose an intelligent system that depends on a proactive mechanism of monitoring and notification, as well as a mathematical model which is responsible for identifying the system faults.
Others studies lack many details about the failure detection process. For instance, in [27], failure detection is implemented together with failure mitigation (recovery) in a process called Fault Injection. This process aims to evaluate the framework capacity to handle failover possibilities. Also, in [7], authors proposed a HA middleware inside VMs for monitoring and restarting in case of failures.
In [16], authors proposed an architecture with an entity called LFM (Local Fault Manager), located in all physical host. It is responsible for collecting resource information such as memory, processes, etc. and transferring it to the next layer, which is responsible for decision making, similar to a monitoring service. Moreover, LFM also runs HAS (High-Availability Service) that keeps synchronization between primary and backup VMs, and is responsible for making backup VM active when a failure is detected in the primary VM.
Recovery
The recovery service is responsible for ensuring fault-tolerant performance through some services like redundancy [17], which means preserving HA even during crashes at application, virtual or physical machine level. It can be classified into smart [15, 16, 20] and simple [23, 28] (Fig. 12). The smart recovery uses other services and mechanisms (such as monitoring and checkpoint) to provide an efficient restoration with minimum losses for the application. Meanwhile, considering simple recovery, the broken application is just rebooted in a healthy node, so that the service continues to be provided, but all state data are lost.
The smart recovery proposed in [15] is guaranteed through a fault tolerant mechanism that keeps an application backup synchronized with active applications but deployed in a different VM. Authors in [16] work in a similar way, starting with the Remus project as base and applying a technique for VM failover using two VMs (primary and backup) that periodically synchronize states and are able to change from primary VM to backup, when needed. In [20], recovery is reached using an active replication technique, where a controller manages a priority list through Backup-ID from resources. Therefore, after a failure, broadcast communication is made and other nodes at the top of the list must assume the execution.
Furthermore, authors in [23] decided to use the simple recovery after a failure by using merged snapshots, in which faulty agent requires the manager any of snapshot available. In addition, work in [28] also uses simple recovery, in which the VMS are monitored by a VM wrapper that identifies unavailability and makes reboots.
Middleware
At the upper layer, we have middleware that uses services to provide HA to applications. The main goal is to manage how these services will operate, configure them, and take decisions according to information acquired.
OpenSAF [10] is an open source project that offers some services that implement the SAForum Application Interface Specification (AIS). For instance, OpenSAF implements the Availability Management Framework (AMF), which is the middleware responsible for maintaining service availability. Is also implements the checkpoint service (CPSv) that provides a means for processes to store checkpoint data incrementally, which can be used to protect applications against failures. For a detailed description of all SAF services implemented by OpenSAF, please see [10].
Since OpenSAF is used for general purpose, some studies use it to implement their Cloud solutions. For instance, authors in [7] propose an HA middleware for achieving HA at application level by using an SAF redundancy strategy. The middleware is responsible for monitoring physical and virtual resources, and repairing them or restarting VMs in case of failure. They also propose an HA integration. Basically, there is an integration-agent, which a Cloud user interacts with in order to provide information about its application and its availability requirements (such as number of replicas and redundancy model); and there is an HA-agent, which is responsible for managing the state of state-aware applications, and abstracting the complexity of APIs needed to execute the checkpoint service.
OpenStack12 is an open source platform for public and private Clouds used to control large pools of computation, storage and networking resources. OpenStack has several components, and each component is responsible for a specific aspect of the Cloud environment. For instance, the component named Nova is responsible for handling VMs, and providing different flavors and images that describe details about the CPU, memory and storage of a VM. Another component is Neutron, which responsible for network management functions, such as the creation of networks, ports, routers and VMs connections. Considering the HA scope, we highlight the component called Heat that is OpenStack’s orchestration tool. Using Heat, one can deploy multiple composite Cloud applications into OpenStack’s infrastructure, using both the AWS CloudFormation template and the Heat Orchestration Template (HOT). In terms of HA, with Heat it is possible to monitor resources and applications from three basic levels13: 1) application level; 2) instance level; and 3) stack level (group of VMs). In case of failure, Heat tries to solve the problem in the current level. If the problem persists, it will try to solve it in a higher level. However, restarting resources can take up to a minute. Heat can also automatically increase or decrease the number of VMs, in conjunction with Celiometer (which is another OpenStack service) [25].
The paper [28] presents an OS-like virtualization cloud platform. They offers a dual stack API in the shell. One is called "Kumoi" and is used to manipulate data centers directly, while the other is called "Kali" and is used to build up the stack of cloud computing. With this cloud platform authors provide several HA services, such as checkpoint, monitoring, failure detection, recovery and elasticity. One should notice that services are provided at the VM level. They also present a qualitative evaluation between their tool and several others, such as Openstack, Nimbus, and OpenNebula.
The proposed solution in [20] is a high availability and fault tolerance middleware through the checkpoint, watchdog and log services for applications in a cloud environment. The authors claim that two issues are responsible for reaching middleware objectives: notifications without delay and monitoring of resources, which is achieved through an analytic model that identifies the fault nature. The Cloud-Niagara algorithm is shown and performs adjustments at nodes through resources calculation. The mean time to recover of the proposed solution is compared to other systems and evaluated on OpenStack, where Cloud-Niagara operates, by executing processes from real applications (PostgreSQL Database (DB), File Transfer Protocol (FTP), etc). This evaluation shows the CPU usage variation through different loads from the execution of applications processes execution, presenting the importance of monitoring the effective replica instantiation.