Skip to main content

Advances, Systems and Applications

Fault tolerant software systems using software configurations for cloud computing


Customizable software systems consist of a large number of different, critical, non-critical and interdependent configurations. Reliability and performance of configurable system depend on successful completion of communication or interactions among its configurations. Most of the time users of configurable systems very often use critical configurations than non-critical configurations. Failure of critical configurations will have severe impact on system reliability and performance. We can overcome this problem by identifying critical configurations that play a vital role, then provide a suitable fault tolerant candidate to each critical configuration. In this article we have proposed an algorithm that identifies optimal fault tolerant candidate for every critical configuration of a software system. We have also proposed two schemes to classify configurations into critical and non-critical configurations based on: 1) Frequency of configuration interactions (IFrFT), 2) Characteristics and frequency of interactions (ChIFrFT). These schemes have played very important role in achieving reliability and fault tolerance of a software system in a cost effective manner. The percentage of successful interactions of IFrFT and ChIFrFT are 25 and 40% higher than that of the NoFT scheme. In NoFT scheme none of the configurations are supported by fault tolerance candidates. Performance of IFrFT, ChIFrFT, and NoFT schemes are tested using a file structure system.


Customization of a software system varies with user requirements or target platform. Programmers employ preprocessor directives, command-line arguments, setup files, configuration files to customize a software system. Using product-line technology, it is possible to generate a program tailored to individual user requirements by using program generators. Program generation process leverage software system features where a feature is a visible behavior or characteristic of a software program [1]. According to product-line technology, any customizable option that can be selected during the compile or load time is called a feature of a program. Program generator generates a program depending on features selected by the user(s).

Every program or software system will have functional and non-functional properties. Functional properties are activities that are to be performed or a behavior that has to be exhibited by a program when a specified condition is met. Most of the non-functional properties are related to performance, such as accessibility, fault-tolerance, reliability, scalability, recoverability, maintainability and availability of program(s). Most of the time users will be interested in non-functional properties such as allocation and release of memory, emailing electronic advertisements to users. A software configuration consists of non-functional properties customized to a specific set of features. For example, preprocessor directives customized to a particular set of requirements and processed during compilation of software system will become configurations.

Generally software systems consists of several different configurations, in turn, each configuration consist of many features. Similar to a configuration model with a few configurations produce several variants of software systems in [2, 3], a feature model that consists of few features can produce several numbers of configurations. An interaction is a communication between two or more configurations of a software system to exchange data. A software system that consists of hundreds or thousands of configurations having different failure behavior is prone to failure.

Configurations that perform important, common or default operations are called critical configurations. During the execution of software system many non-critical configurations interact with critical configurations. Critical configurations are specialized versions of non-critical configurations similar to many features are extended versions of important features [4].

A configuration is said to be failed configuration if it either produce errors during its execution or fail to successfully complete its task. A software system that consists of configurations which are prone to failure will have unpredictable behavior and performance anomalies making it unusable or untrustworthy. Communication (parameters, return values, etc.) between two or more configurations is called Configurations Interaction. Incomplete/failed communication between two or more configurations is said to be Failed Interaction. The significance level of critical configuration will come down when it is involved in failed interactions. Failed interactions have become common than exceptions [5] in modern complex software systems. Further, by classifying the configurations into two categories as given below, we can improve software systems reliability and fault tolerance.

1) Frequently used or Critical configurations, and

2) Less frequently used or non-critical configurations

After the classification of configurations into the categories mentioned above,frequently used configurations are backed up by fault tolerant candidates to improve the reliability and performance of software systems.

According to software reliability engineering, the main approaches to build reliable software systems are 1) Fault Forecasting [6, 7], 2) Fault Prevention, 3) Fault Removal [8] and 4) Fault Tolerance [9]. Fault prevention and fault tolerance techniques are leveraged in the development of large and reliable complex software systems. To tolerate faults, design diversity technique proposed by Avizienis et al. [10], employs functionally equivalent yet independently designed and developed software system modules. Since this technique incur high development and maintenance cost it is employed only in the development and maintenance of machine critical systems (disaster control, threat of loss of life).

Software systems usually consist of hundreds of configurations. Although provisioning redundant configurations for every configuration of large and complex software system help to tolerate failures, it would be very expensive and results in high upfront investment cost. It is possible to minimize investment cost by provisioning redundant configurations only for frequently configurations. Microsoft has reported that it has reduced 80% of failures by fixing top 20% of most frequently reported bugs or crashes in its windows and office suite. Generally 80% of end users use only 20% of software application features [11].

Our contributions in this paper are as follows:

  • Classification of configurations of software systems into following categories: 1) Frequently used (Critical) configurations and 2) Less frequently used (non-critical) configurations.

  • Compare the performance of the following proposed strategies 1) Frequently-used (Critical) configurations enabled with fault tolerance, 2) All configurations of a software system enabled with fault tolerance, 3) None of the configurations have fault tolerance support.

  • Demonstrate practicality of the proposed strategies on a file structures system that consists of several configurations.

Organization of this paper is as follows: “Related work” section presents the related work, “System model and failure model” section present system model and failure models used in this work, “Overview of proposed approach” section introduces an overview of the proposed approach and system architecture, “Classification of configurations” section presents a mathematical model to classify configurations, “Fault tolerance schemes” section presents various fault tolerant schemes, selection of most suitable fault tolerant candidate to each configuration, finally the “Experimental setup and performance evaluation” section presents experimental setup and results discussion.

Related work

In cloud computing environment occurrence of failures is random in nature and also there may be unknown types of failures. It is important either to eliminate failures permanently or minimize the impact of failures. Fault tolerant approaches are broadly classified into two categories. 1) Reactive schemes and 2) Proactive schemes.

Reactive schemes swing into action as soon as software system failed. Some of the popular reactive techniques are:

a) Checkpoint-Restart: In this approach the overall execution time of a job/task depends on checkpoint intervals. Shorter checkpoint intervals cause frequent checkpointing, large number of checkpoints thus resulting in much time spent in checkpoint activity, higher disk space required to store checkpoint images. Longer checkpoint intervals cause high resource wastage due to re-execution of a job from its longer distant previous checkpoint. A checkpoint interval that is very good for some workloads may be worse for other workloads. For example, in case of transmission failures shorter checkpoint intervals completely eliminate retransmission of lengthy messages of TCP applications (deal with longer messages) while checkpointing UDP (deal with short messages) applications with longer intervals eliminate the need for large storage space to save checkpoint images and reduce the cumulative compute power overhead of all checkpoints. Research on finding most suitable checkpoint intervals for different workloads, yet to happen. Checkpointing identical operations performed by many users on several computing machines will lead to wastage of resources and increased cost. To overcome this problem a technique proposed by Zhou et al. in [12] checkpoint identical parts of all virtual machines which host and render identical services.

b) Job migration: Job migration can be further classified into preemptive and non-preemptive migration. In preemptive migration mechanism a job undergoes migration whenever a monitoring system detects misbehavior of the job. Accuracy of preemptive migration schemes depends on fitness of failure prediction algorithms. Incorrect predictions lead to unnecessary job migrations, increased resource wastage and longer job turnaround time. In this approach Mean Time To Migrate (MTTM) consists of time taken i) by prediction algorithm, ii) to notify run time environment iii) to complete the migration task and iv) how best the prediction algorithms know types of failures that occur. In preemptive approach Mean time to migrate a job is less compare to non-preemptive approach. Preemptive approach is more suitable for critical machine control systems such as aviation, nuclear control systems, etc.

c) Replication: In replication mechanism a job is replicated over multiple compute platforms in which replicas are executed in parallel. Voting mechanism is used in case of more than two replicas to decide correct output. As an alternative method to parallel execution of replicas, the concept of primary and secondary was proposed. In this approach secondary job takes over the role of primary as soon as primary job failed. Byzantine Fault Tolerance (BFT) approach [13] proposed by Zhang et al. make use of 3k+1 replicas to overcome K faults. In gossip approach proposed by Lim et al. [14] 2K+1 replicas are used to tolerate K faults. For further reading about job migration techniques interested reader may refer following articles [1519].

To improve the fault tolerance of distributed applications in a cloud computing environment, Zhao et al. in [20] proposed a fault tolerant middleware that consist of three components: 1) A low level messaging protocol that render reliable, totally ordered multicast service between primary and backup members of a group. 2) A leader-determined membership protocol to ensure that every member of a group has a consistent view of primary member and other members belong to its group, also shorten the time required to elect a primary member among several members of the group. 3) A virtual determiner framework that transforms every non-deterministic operation of a primary member to virtually deterministic operation while guaranteeing that all backups shall receive results in the same order as that of the primary member.

Proactive schemes: In proactive schemes hardware and computing environment continuously monitored for occurrence of failures by employing best failure prediction algorithms. Prediction algorithms take decisions based on the outcome of intermittent results and communication messages. Whenever a monitoring system detects deviation in behavior of a job due to failure of underlying hardware or computing environment, then the job is migrated to fault free computing environment. Some of the schemes in proactive fault tolerance are as given below:

a) Self Healing: Self healing systems detect and respond to malfunctions that occur during run time. They have the ability to recognize failures and take suitable action immediately to minimize the impact of failures. Functional and interaction failures are common that occur during run-time in component based software systems. An approach proposed by Nicolo P. [22] exploits redundancy of components to mask failures. Cliff Saran in [23] highlighted some of the difficulties involved in usage of self-managing systems such as common software standards across the industry and proprietary system management interfaces. Quality of service of an application is ensured by dealing with its quality attributes. In [24] V. Nallur et al. have proposed a self-optimizing architecture based on service level agreements. Using this architecture any service hosted on the cloud will dynamically self-manage depending on service provider inputs to minimize the cost of service. As far as possible this architecture ensures that the user service level agreements are satisfied while the service is being re-configured according to service provider input.

Execution of parallel applications in a cloud computing environment requires multiple processing nodes. Processing nodes that have poor coordination/synchronization will produce incorrect results and would lead to wastage of resources. An algorithm proposed in [25] schedule parallel applications based on virtual machine characteristics to reduce consumption of physical resources such as network elements and power within the data center.

b) Software Rejuvenation: This approach deal with handling transient failures caused by software aging phenomenon. It involves resetting a part of the internal state of a job followed by restarting the execution of a job. Fault tolerant software systems with two-version redundant structures and single-version rejuvenation were proposed in [26] and [27] respectively. An approach based on backup of virtual machines in the cloud [28] was proposed by Xinyi et al. to improve system reliability. For more information on software rejuvenation interested reader may refer to [2932].

c) Preemptive Migration: Migration of processes, operating systems and virtual machines enable load balancing, efficient resource usage, fault management and system maintenance. Migration of virtual machines, computing platform across homogeneous and heterogeneous hosts help administrators manage data centers, clusters in easy and efficient manner. Some of the research articles published in this area are [3336]. Proactive recovery techniques such as staggered proactive recovery, rebooting a compromised node, refreshing the state of a node do not guarantee hundred percent recovery of nodes and are not sufficient enough to prevent Byzantine faults permanently. Since the reboot and recovery operations are performed as part of the critical execution sequence, they take more time to achieve a stable state by recovering failed node. A proactive recovery mechanism based on service migration [21] was proposed to overcome limitations of reboot and recovery mechanisms. In this scheme pool of spare nodes are used to migrate long-running applications off the critical path, run as background/separate tasks, thereby minimizing the reboot and recovery time.

Fault tolerance can be considered during the design, development and architecture of large, complex software systems [6]. Considering virtual machines and cloud elasticity property Xiaomin Zhu et al. have proposed an algorithm, FASTER, [37] that overlap workload tasks to improve reliability of scientific workloads in virtual clouds. Seigmund et al. in [38] have proposed an approach that treats customizable systems as a black box and detect performance relevant feature interactions on a set of specially selected configurations. The pair-wise heuristic technique was used to and measure a set of configurations that select cover all pairwise interactions among configurations relevant to performance.

Liebig et al. in [39] have analyzed the variability characteristic in forty software programs ranging from small to large scale and claimed that structural interactions occur betweens two features and such interactions in turn may cause performance feature interactions. Sincero et al. in [40] proposed a model that deals with feature attributes, interaction among features that predict configurations non-functional properties. Software feature behavior that produce an unexpected outcome indicates malfunctioning of software feature. Zheng et al. [41] and Slawinska et al. [42] have proposed collaborative solutions for specific applications such as MPI. Parnas et al. [43] and Shooman et al. [44] highlighted the impact of the structure of an application on its performance, reliability and correctness.

Fault tolerance techniques proposed in this article are based on replication. As there exist many robust, efficient, optimal fault tolerant schemes in the literature related to hardware, middleware, platform levels, we assume that there are no failures in hardware, middleware, and platform levels. Fault tolerance in hardware, middleware, computing platform is out of the scope of this article. In this article we address fault tolerance at the application level by exploiting application characteristics such as structure and feature’s behavior. In this direction we propose a scheme to identify frequently used configurations which play vital role in a software system, then enable them configurations using suitable fault tolerance candidate to improve reliability and fault tolerance of a software system. Also, we investigate the importance of configurations interactions on system performance, reliability and fault tolerance.

System model and failure model

  • System Model: The various operations of file structures techniques involve interaction among several configurations and successful execution of their interactions. Pictorial representation of the system is shown in Fig. 1. File structure system has several configurations, each of which have their functionally equivalent yet an independent configuration written in different programming logic. Every independent configuration act as fault tolerant candidate. Fault tolerance of the system depends on the number of redundant configurations, type of failures and number of failures. In our file structure system each critical configuration is supported by only one fault tolerant candidate and two types of failures are dealt with. Requests are generated to save new data records, fetch existing data records from the file structure. Pack operation is invoked by all techniques of file structures for every request to create a new record. Similarly unpack operation is invoked for every request to fetch data record(s) from file structures.

    Fig. 1
    figure 1

    System Model

    Value of record fields are passed to/returned from one configuration to another configuration. During this process faults are introduced by flipping some bits of the fields of a record through random selection. When faults are introduced in non-critical configurations, then such configurations are treated as permanently failed configurations. Subsequent requests for such configurations are considered as unsuccessful/failed interactions. When faults are introduced in a critical configuration that has the support of the fault tolerant candidate, then subsequent requests will be delegated to the fault tolerant candidate. However, upon failure of fault tolerant candidate subsequent requests targeted to it are treated as failed requests.

  • Failure Model: In the file structure system we have considered 1) Response failures: a) Value failure, b) State transition failure and 2) Omission failures: a) Receive omission failure, b) Send omission failure as given in Table 1. Value failures and state transition failures are injected randomly into the record fields to flip values of some of the bits of record fields. Value failures are injected into the parameters and return value of functions. State transition failures occur when we change the value of variable(s) used in conditional constructs resulting in execution of different part of a program. For example block of statements belong to the else part of an if statement instead of true part. Omission failures occur during read and write operations on file structure. During the execution of a request to create a new record some of the bits of record fields are set and reset to introduce failures and such failures contribute to receive omission failures. Similarly, while reading record fields some of the record fields are modified by setting and resetting bits and such failures are called send omission failures. As shown in Table 1 response failures occur in pack, indexes, balanced and hashing techniques while omission failures are generated in unpack, indexes, balance trees, hashing, and storage compaction techniques. For example hash key is modified while a record is being hashed. As a result of modified key, hash algorithm may or may not produce collisions, that is, it might produce collision for a key to which collision should not occur and vice-versa.

    Table 1 Failure models

Overview of proposed approach

Figure 2 depicts a pictorial representation of the proposed architecture to address some of the challenges mentioned above. It consists of two parts: 1) Classification of configurations and 2) Selection of the most suitable, optimal fault tolerant candidate to every critical configuration. Procedure to achieve fault tolerance of a software system is as follows:

  • We have used two versions of the file structure software system. In first one, all configurations of a software system are supported by their replicas. In the second version, only critical or frequently used configurations, which contribute to 20% of all configurations, are supported by suitable fault tolerant candidates that are functionally equivalent yet independent configurations developed using different programming constructs/logic to mask failures.

    Fig. 2
    figure 2

    System architecture of the proposed work

  • After the completion of all interactions, interaction value of each configuration is calculated based on the number of successful interactions.

  • Configurations are classified into most frequently used (critical) and less frequently used (non-critical) categories based on their interaction values.

  • Effectiveness of each fault tolerance strategy is measured and compared with other strategies, then most suitable fault tolerance strategy for a given critical configuration is selected.

Classification of configurations

Configurations interaction graph building

A software system can be modeled as a weighted directed graph denoted by G, where C i represent ith configuration and an edge denoted by I jk represent interaction between configurations C j and C k , that is, configuration C j invokes C k . In this process every configuration, say C i , that is involved in successful interaction has a non-negative interaction value P(C i ). At the end of all interactions the percentage of successfully executed interactions between every pair of configurations, for example, between C j and C k , are calculated using Eq. 1

$$ M\left(I_{jk}\right) = \frac {F_{jk}} {\sum_{k=1}^{N} {F_{jK}}} $$

Where F jk represent the number of times configuration C j invoked C k and N is the number of configurations present in a software system. If there is no interaction between C j and C k then F jk =0. Value of I jk increased by one for every successful interaction between C j and C k . An interaction I jk will have a larger value if C j interacts with C k more number of times than other configurations. For a software system having N configurations, the configuration graph represent NN matrix, denoted by M, that can be obtained from Eq. 1. Each element of M, denoted by m jk , represent interactions between jth and kth configurations. A configuration interaction value between every pair of configurations is calculated according to the steps given below:

  • If jth configuration during its life time never interact with kth configuration, then the interaction value of jth configuration with kth configuration will be zero, that is, M(I jk ).m jk =0

  • If jth configuration interacts with itself, in case of recursive configurations, then its interaction value represented by m jj is incremented by one for every successful interaction.

  • If jth configuration interact with none of the configurations except kth, then the interaction value will be calculated as \(I_{jk}=\frac {1}{N}\) for all k=1 to N except j. Matrix M is a stochastic matrix since all the elements in a row of M sum to one and each element lies in the range [0, 1].

Configurations classification

Software system configurations are categorized based on :

1) Frequency of interactions between configurations (IFrFT)

2) Characteristics such as structure (critical or non-critical) and frequency of interactions (ChIFrFT)

IFrFT: fault tolerance based on frequency of interactions

In this approach all configurations are treated as equal by setting their individual interaction value to zero at the beginning of execution of a software system. Configurations that are involved in most of the interactions during execution of software system are considered as critical configurations. Reliability and performance of any software system depend on successful execution of its critical configurations. For an increased number of failed critical configurations and configuration interactions reliability and performance of software system deteriorate significantly. By employing fault tolerant candidates to critical configurations, successful execution of most of interactions, it is possible to improve the reliability and performance of a software system. In this context, we propose a mathematical model to find frequently used configurations of a software system based on frequency of interactions. The proposed mathematical model is as follows:

  • Configuration C i is assigned to a value \(\frac {V_{i}} {\sum _{j=1}^{N} {V_{j}}}\), where N is the number of configurations, V i is the significance of ith configuration.

  • Compute the interaction value of ith configuration C i , denoted by P(C i ), for i=1 to N, using

    $$ P\left(C_{i}\right) = \frac {1-\alpha} {N} + \alpha \sum\limits_{j \in S\left(C_{i}\right)} { P(C_{j}) M\left(I_{ji}\right)} $$

    where S(C i ) is a set of configurations that invoke interactions with configuration C i . Variable α (0≤α≤1) in Eq. 2, indicate the interactions influence, in terms of interaction value, between C i and C j . Therefore, an interaction value of C i is derived from configurations that have invoked it for interaction. Interaction value of C i depends on |S(C i )|, P(C j ), M(I ji ). The larger value of C i indicates that it has been invoked very frequently by other configurations. Using Eq. 2 we find frequently used configurations of a software system. Value of α is varied between zero and one until we get stabilized interaction values for 10000 interactions.

Using Eq. 2 we obtain configurations that are frequently used and play important role in improving the reliability of a software system.

ChIFrFT: fault tolerance based on characteristics and interaction frequencies of configurations

In this approach software configurations are categorized into two sets 1) A set, denoted by C, of configurations which perform critical tasks (checking for uniqueness, size of index table, search operation, and collisions are treated as characteristics) 2) A set, denoted by UC, of configurations which perform non-critical tasks (the padding of special symbol in fixed-length records). Initial interaction values of configurations that belong to the set C are assigned to higher values compare to configurations of UC. Since the reliability and performance of software systems depend on successful execution of their critical configurations, we can improve the reliability, performance by provisioning fault tolerant candidates to critical configurations.

Performance (defined in terms of interaction value and the number of successful interactions) of a critical configuration, say C i , where C i C is computed by using Eq. 3 given below:

$$ P\left(C_{i}\right) = {\left(1-\alpha\right)} \frac{\beta} {\lvert{C}\rvert}+ \alpha \sum_{j \in S\left(C_{i}\right)} { P\left(C_{j}\right) M\left(I_{ji}\right)} $$

For a non-critical configuration C k , where C k NC, we compute the performance using Eq. 4 given below:

$$ P\left(C_{k}\right) = {\left(1-\alpha\right)} \frac{1-\beta} {\lvert{NC}\rvert}+ \alpha \sum_{j \in S\left(C_{k}\right)} { P\left(C_{j}\right) M\left(I_{jk}\right)} $$

Where S(C k ) is the set of configurations that invoke configuration C k , sum of |C| and |NC| is equal to N. In ChIFrFT approach the parameter β is used to find the importance of configurations belong to sets C and UC. For \(\beta \le \frac {\lvert {\textit {C} \rvert }} {\lvert {NC}\rvert }\) performance of ChIFrFT approach is same as that of IFrFT, that is, frequently used and less frequently used configurations are treated equally. However, ChIFrFT performs better than IFrFT for \(\frac {\lvert {\textit {C} \rvert }} {\lvert {NC}\rvert } \le \beta \le 1\), that is, frequently used configurations are treated with higher priority than less frequently used configurations.

Fault tolerance schemes

Classification of software fault tolerance techniques

Software fault tolerance schemes are classified into two major categories. 1) Proactive schemes and 2) Reactive schemes

  • Proactive Fault Tolerance Schemes: Proactive fault tolerance schemes are adopted in computing systems which suffer severely due to failures. Proactive schemes are costlier as they need lots of resources to support simultaneous execution of primary job and its replicas. Some of the proactive fault tolerance schemes are:

    1. 1.

      Preemptive migration: Preemptive migration [45] techniques require very good failure prediction algorithms. For instance prediction of faults related to performance, resource outages and functional failures for timely actions. Effectiveness of preemptive migration schemes depends on the correctness of the output of prediction algorithms.

    2. 2.

      N-Version programming: In N-version programming [46] N programs, N≥2, that are functionally equivalent yet independent as shown in Fig. 3 are generated from the initial formal specification defined using any specification language. Some of the specification languages are Clear, OBJ3, CafeOBJ, ACT, ASL, ASL+, SPECTRUM, and Larch.

      Fig. 3
      figure 3

      N-Version programming

    3. 3.

      N-Copy programming: N-Copy programming as shown in Fig. 4 is the data diverse complement of N-Version Programming. All copies of programs execute in parallel by taking data produced by data re-expression algorithms [47].

      Fig. 4
      figure 4

      N-Copy programming

    4. 4.

      Byzantine fault tolerance: A Byzantine fault is an incorrect operation that occurs in a distributed environment. Byzantine faults can be classified into 1) Omission failure: Failure of a resource means requested resource might not exist or unavailable due to busy 2) Execution failure: Failure due to sending incorrect or inconsistent data, corrupting local state or responding with incorrect data, for example, round-off errors propagated from one function to another function [48].

      To provide Byzantine fault tolerance in presence of m faulty processors, (1) There must be at least 3m+1 processors, (2) Must exist, at least 2m+1 communication paths between a pair of processors, (3) There must be m+1 rounds of messages exchanged and (4) Processors must be synchronized within a known skew of each other.

  • Reactive Fault Tolerance Schemes: Reactive fault tolerance schemes act on the recovery process after the occurrence of failure. These schemes take less amount of resources, more time for recovery, compare to proactive schemes. Some of the reactive schemes are:

    1. 1.

      Recovery Block programming: Recovery block [49] is a popular technique employed in software fault tolerance domain. Figure 5 depicts the pictorial representation of recovery block technique. This technique makes use of redundant program modules. Redundant program modules are executed in sequential manner. For example, in Fig. 5 RB2 executed when RB1 failed, RB3 executed when both RB1 and RB2 failed. Similarly redundant block RB n is executed when all the redundant blocks other than RB n , that is RB1 to RBn−1, are failed.

      Fig. 5
      figure 5

      Recovery Block programming

      This technique fails when all redundant program modules failed. Probability of failure of recovery block technique, denoted by F, can be calculated by using the equation given below:

      $$ \textit{F} = \prod\limits_{i=1}^{n} f_{i} $$
    2. 2.

      Process checkpointing: Checkpointing is a technique that is commonly used to reduce execution time of long running programs in the presence of failures [50]. In this technique, the status of the program/job under execution is saved intermittently. Program/job resumes its execution from the most recent checkpoint instead of from the beginning of the program/job upon occurrence of failure. For a program/job, having n modules with n-1 equally spaced checkpoints, as shown in Fig. 6, expected execution time of a program/job is given by,

      $$ {{} \begin{aligned} E\left(T\left({~}_{x,n}\right)\right)&=\left(\frac{1}{\gamma}+E(R)\right)\left[(n-1)\left(\phi_{c}\left(-\gamma\right) e^{\frac{\gamma x}{n}}-1\right)\right.\\& \quad +\left.\left(e^{\frac {\gamma x}{n}}-1\right)\right] \end{aligned}} $$
      Fig. 6
      figure 6

      Checkpoint Technique

      where x is program/job processing requirement, random variable C is a checkpoint duration, R is a random variable that represents repair time, γ is the rate at which failures occur according to a Poisson distribution, ϕ c is the Laplace-Stieltjes transformation (LST), E(R) is the expected repair time.

    3. 3.

      Process migration: It is an act of transferring state information of live processes from one machine to another across homogeneous or heterogeneous hardware, platforms [51]. Process migration technique to realize its full potential must be able to perform the following tasks 1) A process must preserve its internal state during migration, 2) Must be efficient and 3) Must be as transparent as possible to the external environment.

Selection of best fault tolerance scheme

Each fault tolerant scheme will have several unique candidates based on their configurations and characteristics such as response time, resources like computing power, storage, etc. For example, replication fault tolerance schemes are storage and compute power intensive whereas process migration schemes are time intensive. A fault tolerance scheme, say F i , may have J candidates F ij for J=1 to n.

Given the interaction value of a configuration, values of all eligible fault tolerant schemes (set of redundant configurations which are functionally equivalent, but independent) that are suitable to C t as arguments to Algorithm 1, it selects the most optimal fault tolerant candidate from the set of eligible fault tolerant candidates. Each fault tolerant scheme will have several candidates depending on the characteristics and constraints to be satisfied. To select the most suitable and optimal fault tolerant scheme to a given critical configuration above algorithm uses failure probability of every fault tolerant scheme. For the set of constraints of a critical configuration given as input to the algorithm, it generates a list of suitable fault tolerant candidates whose failure probability is less than the configuration interaction value. Finally the best FT candidate that has minimum failure probability among all possible candidates will be assigned to the critical configuration.

Experimental setup and performance evaluation

Using C++ language we have developed a software system for file structure techniques proposed by Michael J. Folk in [54]. Details of the software system are given in Table 2. It consists of several configurations which are functionally equivalent yet independent to achieve fault tolerance. The file structure system involves following operations: 1. Fixed-length record packing 2. Variable-length record packing 3. Unpacking of fixed-length records 4. Unpacking of variable-length records 5. Searching for a record using Relative Record Number(RRN) 6. Create Primary Index 7. Create Secondary Index 8. Balanced Trees 9. Hashing 10. Dynamic Hashing and 11. Storage Compaction

Table 2 Overview of software system considered

Figure 7 shows a sample code snippet for record packing, unpacking, and index operations.

Fig. 7
figure 7

Code snippet for pack, unpack and index operations

Table 3 shows the list of various operations, role and specifications, possible failures of each operation.

Table 3 Possible failures in each operation

Table 4 shows the list of various failures caused by failed interactions and their descriptions.

Table 4 List of possible failures and their description

We have used two versions of a file structure system 1) With all of its configurations enabled, 2) Functionally equivalent yet independent versions of frequently used (critical) configurations that are 1%, 5%, 10%, 15%, and 20% of all configurations which act as a fault tolerant plane. The file structure system is deployed on VMWare vCenter Server 4.1 that runs on PowerEdge T340 having following configuration: Intel Xeon E5-2430, 2.2 GHz, 16 GB RAM and vSphere Cleint 4.1 installed on the remote desktop computer. Execution requests to multiple groups that consists of both frequently used and less frequently used configurations are generated using a synthetic workload generator. Every execution request randomly generates values for fields of a data record. The operations listed in Table 3 cause interactions between dependent configurations. By monitoring configuration interactions for failure, information about failed interactions logged for further processing.

Fault tolerance techniques IFrFT, ChIFrFT, AllFT, NoFT and FT candidate selection algorithm are implemented in C++ language. In IFrFT and ChIFrFT techniques frequently used configurations are supported by fault tolerant candidates. None of the configurations are supported by fault tolerant candidates in NoFT technique. In AllFT technique every configuration is supported by their replicas. The percentage of failed configurations varied from 10 to 100% in steps of 10. The maximum number of frequently used configurations is set to 20% of all configurations. Value for parameter α in Eq. 2 varied between 0.7 and 0.9 to identify the best value as it is used in [52, 53].

  • Among all the schemes performance of AllFT is better since every configuration is supported by its replica. In this scheme replicated configurations help the system to overcome most of the failures for less number of interaction requests, for e.g. 2000 and 4000.

  • Performance of ChIFrFT is better than IFrFT and NoFT in all the experimental scenarios since it considers both frequency of interactions and characteristics of configurations. This signifies that both characteristics and frequency of interactions of configurations play important role in improving performance and reliability of software systems.

  • Reliability and performance of software systems deteriorate as the number of failed interactions increased. This effect can be nullified by provisioning fault tolerant candidates to frequently used configurations.

  • For any failure percentage between 10 and 100, proposed IFrFT and ChIFrFT techniques perform better than NoFT while ChIFrFT outperform IFrFT.

  • In case of lesser number of interaction requests, the software system has better performance and reliability when frequently used configurations that are enabled with fault tolerant candidates increased from 1 to 20%. However, performance and reliability of software system worsen for a large number of interaction requests.

We study the impact of a K percent of critical or frequently used configurations on system performance, reliability for different values of K (that is, 1%, 5%, 10%, 15%, and 20%) and interaction requests (2000, 4000, 6000, 8000, and 10000).

  • ChIFrFT has consistently better performance compared to IFrFT as the percentage of frequently used configurations increased.

  • The rate at which the amount of successful interactions decrease depends on the percentage of frequently used and failed configurations.

  • When frequently used configurations are 1% of all configurations of the system and the number of failed configurations increased from 10% to a hundred percent, then the number of successful interactions decrease at a faster rate as shown in Fig. 8. Reason for this is only 1% of critical configurations are supported by fault tolerant candidates. Under this scenario, when all the configurations are failed, only fault tolerant candidates which are 1% of all configurations serve requests.

    Fig. 8
    figure 8

    Critical configurations are 1% of all configurations

  • There is no much difference in percentage of successful interactions for ChIFrFT, IFrFT, and NoFT techniques when frequently used configurations are only 1% of all configurations as shown in Fig. 8.

Figure 9 shows the performance of a system when frequently used configurations are 5% of all configurations. As the percentage of failed configurations increased from 10 to 100 in steps of 10, the number of successful interactions will decrease. However, in this case the number of successful interactions is higher compared to the scenario where frequently used configurations are 1% of all configurations. When all the frequently used configurations are failed then fault tolerant candidates, which are equivalent to 5% of all configurations serve requests.

Fig. 9
figure 9

Critical configurations are 5% of all configurations

In this case performance of ChIFrFT, IFrFT techniques are little better than NoFT technique because of 5% of configurations are supported by fault tolerant candidates.

Figure 10 shows the performance of the file structure system when frequently used configurations are 10% of all configurations. In this case system performance is better compare to previous scenarios in which frequently used configurations are 1% and 5%. Furthermore, for a larger number of failed configurations NoFT technique suffer severely compare to ChIFrFT and IFrFT techniques.

Fig. 10
figure 10

Critical configurations are 10% of all configurations

Figure 11 shows the performance of the file structure system when frequently used configurations are Fifteen percent of all configurations. Large number of interaction requests, higher number of failed configurations lead to a large number of failed requests, even though every configuration has its replica.

Fig. 11
figure 11

Critical configurations are 15% of all configurations

As the number of interaction requests increased from 2000 to 10000 in steps of 1000, system undergo performance deterioration due to the large number of failed interaction requests. The reason is that every configuration in AllFT technique is supported by only one backup configuration. After primary configuration failed, its backup configuration also may fail to serve requests.

Figure 12 shows the performance of the system when frequently used configurations are 20% of all configurations. Since 20% of all configurations constitute a considerably large number of configurations that are supported by fault tolerant candidates, performance in this case is better when compared to scenarios in which critical configurations are 1%, 5%, 10%, 15% and 20%. However, 10000 interaction requests in this case have significant impact on performance of AllFT technique. That is, in spite of having replicas for all critical configurations they are prone to failure as the number of interaction requests increased. Also NoFT technique suffer from performance deterioration due to large number of interaction requests.

Fig. 12
figure 12

Critical configurations are 20% of all configurations

Figures 13, 14, 15, 16 and 17 showcase the importance of configuration interactions on performance and reliability of a software system. More the number of interaction requests, then higher the likely hood of reduced performance of a software system. Figure 13 shows the percentage of interactions that are successfully executed when frequently used configurations are 1% of all configurations. This percentage decreases as the number of interaction requests increased.

Fig. 13
figure 13

Critical configurations are 1% of all configurations

Fig. 14
figure 14

Critical configurations are 5% of all configurations

Fig. 15
figure 15

Critical configurations are 10% of all configurations

Fig. 16
figure 16

Critical configurations are 15% of all configurations

Fig. 17
figure 17

Critical configurations are 20% of all configurations

Figures 14, 15, 16, 17 shows the percentage of successful interactions when frequently used configurations are 5, 10, 15, and 20% of all configurations, respectively.


In this research article we have proposed fault tolerant techniques based on frequency of interactions, characteristics of configurations. IFrFT technique based on frequency of interactions makes use of interaction values of configurations to measure the reliability and performance of a software system. ChIFrFT technique is based on frequency of interactions and characteristics (critical operations such as payment transactions and control systems) makes use of structural information of configurations. Performance of NoFT, AllFT, IFrFT and ChIFrFT techniques compared for various percentages of critical configurations, number of interaction requests. While AllFT technique perform better than all the other techniques ChIFrFT approach has consistent performance compare to IFrFT and NoFT. Performance of IFrFT and ChIFrFT techniques, in terms of the number of successful interactions, has increased by 25 and 40% respectively compare to NoFT. Our experimental results show that either large number of interaction requests, or more number of failed critical configurations or both will result in performance deterioration of software systems in spite of having fault tolerant support for the complete software system.



All configurations enabled with fault tolerant candidates


Characteristics and interaction frequency based fault tolerance


Interaction frequency based fault tolerance


None of the configurations of a software system enabled with fault tolerant candidates


Recovery block


Relative record number


  1. BCzarnecki K, Eisenecker U (2000) Generative Programming: Methods, Tools, and Applications. Addison-Wesley Publishing Co.

  2. Siegmund N, Rosenmuller M, Kastner C, Giarrusso PG, Apel S, Kolesnikov SS (2011) Scalable prediction of non-functional properties in software product lines In: 2011 15th International Software Product Line Conference, 160–169, Munich.

  3. Batory D, Höfner P, Kim J (2011) Feature interactions, products, and composition In: Proceedings of the 10th ACM international conference on Generative programming and component engineering, 13–22.. ACM, Portland.

  4. Dart S (1990) Spectrum of Functionality in Configuration Management Systems. Software Engineering Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, Technical Report CMU/SEI-90-TR-011.

  5. Talbert N (1998) The cost of COTS. Computer 31: 46–52.

  6. Gokhale SS, Trivedi KS (2002) Reliability prediction and sensitivity analysis based on software architecture In: 13th International Symposium on Software Reliability Engineering, 64–78.

  7. Wu G, Wei J, Qiao X, Li L (2007) A bayesian network based qos assessment model for web services In: Proc. Int’l Conf. Services Computing (SCC’07), 498–505.

  8. Tsai WT, Zhou X, Chen Y, Bai X (2008) On testing and evaluating service-oriented software. IEEE Comput 41(8):40–46.

    Article  Google Scholar 

  9. Lyu MR (1996) Handbook of Software Reliability. McGraw-Hill, New York.

    Google Scholar 

  10. Avizienis A (1995) The methodology and n-version programming. In: Lyu MR (ed)Software fault tolerance, 23–46.. Wiley, Chichester,

    Google Scholar 

  11. Rooney P (2002) Microsoft’s CEO:80-20 rule applies to bugs, not just features. ChannelWeb.

  12. Zhou A, Wang S, Zheng Z, Hsu C, Lyu M, Yang F (2016) On Cloud Service Reliability Enhancement with Optimal Resource Usage. IEEE Trans Cloud Comput 4(4):452–466.

    Article  Google Scholar 

  13. Zhang Y, Zheng Z, Lyu MR (2011) BFTCloud: a byzantine fault tolerance framework for voluntary-resource cloud computing, 444–451.. 2011 IEEE 4th International Conference on Cloud Computing, Washington, DC.

  14. Lim J, Suh T, Gil J, Yu H (2014) Scalable and Leaderless Byzantine Consensus in Cloud Computing Environments. Information Systems Frontiers, Springer 16(1):19–34.

    Article  Google Scholar 

  15. Ganesh A, Sandhya M, Shankar S (2014) A Study on Fault Tolerance methods in cloud computing. In: IEEE Int’l. Advanced Computing Conference (IACC):844–9.

  16. Vacca JR (2013) Cyber security and IT infrastructure protection. Syngress. PaperBack ISBN: 9780124166813.

  17. Kaur J, Kinger S (2013) Analysis of different techniques used for fault tolerance. IJCSIT: Int. J Comput Technol 4(2):737–41.

    Google Scholar 

  18. Egwutuoha IP, Chen S, Levy D, Selic B (2012) A fault tolerance framework for high performance computing in cloud, Cluster, Cloud and Grid Computing (CCGrid) In: Proceedings of the 12th IEEE/ACM international symposium. 13-16 May, 709–710.

  19. Bala A, Chana I (2012) Fault tolerance- challenges, techniques and implementation in cloud computing, ISSN (Online): 16940814. IJCSI Int J Comput Sci 9(1).

  20. Zhao W, Wenbing Z, Melliar-Smith PM, Moser LE (2010) Fault Tolerance Middleware for Cloud Computing In: 2010 IEEE 3rd International Conference on Cloud Computing, 67–74, Miami.

  21. Zhao W, Zhang H (2009) Proactive service migration for long-running Byzantine fault-tolerant systems. lET Softw 3(2):154–164.

    Google Scholar 

  22. Nicolo P (2013) A frame work for self-healing software systems In: IEEE 35th International Conference on Software Engineering (ICSE), 1397–1400.

  23. Saran C (2017) Cloud self-healing software be the Itdirector’s way of cutting support costs?

  24. Nallur V, Bahsoon R, Xin Y (2009) Self-optimizing architecture for ensuring quality attributes in the cloud In: 2009 Joint Working IEEE/IFIP Conference on Software Architecture & European Conference on Software Architecture, 281–284, Cambridge.

  25. Liu J, Wang S, Zhou A, Kumar SAP, Yang F, Buyya R (2016) Using Proactive Fault-Tolerance Approach to Enhance Cloud Service Reliability. IEEE Trans Cloud Comput PP(99):1–1.

  26. Trivedi TS (1982) Probability and Statistics with Reliability, Queuing, and Computer Science Applications. Prentice-Hall, Englewood Cliffs.

    MATH  Google Scholar 

  27. Hecht H (1976) Fault Tolerant Software for Real-Time Applications. ACM Comput Surv 8(4):391–407.

    Article  MATH  Google Scholar 

  28. Li X, Qi Y, Chen P, Zhang X (2016) Optimizing backup resources in the cloud In: 2016 IEEE 9th International Conference on Cloud Computing (CLOUD), 790–797, San Francisco.

  29. Liu J, Zhou J, Buyya R (2015) Software rejuvenation based fault tolerance scheme for cloud applications In: 2015 IEEE 8th International Conference on Cloud Computing, 1115–1118, New York.

  30. Bruneo D, Distefano S, Longo F, Puliafito A (2013) Scarpa M Workload-Based Software Rejuvenation in Cloud Systems. IEEE Trans Comput 62(6):1072–1085.

    Article  MathSciNet  MATH  Google Scholar 

  31. Araujo J, Matos R, Maciel P, Matias R (2011) Software aging issues on the eucalyptus cloud computing infrastructure In: 2011 IEEE International Conference on Systems, Man, and Cybernetics, 1411–1416, Anchorage.

  32. Langner F, Andrzejak A (2013) Detecting software aging in a cloud computing framework by comparing development versions In: 2013 IFIP/IEEE International Symposium on Integrated Network Management (IM 2013), 896–899, Ghent.

  33. Clark C, Fraser K, Hand A, Hansen J, Jul E, Limpach C, Pratt I, Warfield A (2005) Live Migration of Virtual Machines In: Proceedings of the Symposium on Networked Systems Design and Implementation, 273–286.

  34. Sapuntzakis C (2002) Optimizing the Migration of Virtual Computers In: Proc. 5th Symp. Operating Systems Design and Implementation. Available:

  35. Fu K, Frans Kaashoek M, Mazières D (2005) Fast and secure distributed read-only file system In: Proceedings of the 4th conference on Symposium on Operating System Design & Implementation, San Diego.

  36. Lublin U, Liguori A (2007) KVM Live Migration. KVM Forum, Tucson.

    Google Scholar 

  37. Zhu X, Wang J, Guo H, Zhu D, Yang LT, Liu L (2016) Fault-tolerant scheduling for real-time scientific workflows with elastic resource provisioning in virtualized clouds. IEEE Trans Parallel Distrib Syst 27(12):3501–3517.

  38. Siegmund N, Kolesnikov S, Kastner C, Apel S, Batory D, Rosen Muller M, Saake G (2012) Predicting performance via automated feature-interaction detection In: 2012 34th International Conference on Software Engineering (ICSE), 167–177, Zurich.

  39. Liebig J, Apel S, Lengauer C, Kastner C, Schulze M (2010) An analysis of the variability inforty preprocessor-based software product lines In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering, 105–114.. ACM,Cape Town.

  40. Sincero J, Schroder-Preikschat W, Spinczyk O (2010) Approaching non-functional properties of software product lines: Learning from products In: APSEC, 147–155.. IEEE,

  41. Zheng Q (2010) Improving MapReduce fault tolerance in the cloud In: 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 1–6, Atlanta.

  42. Slawinska M, Slawinski J, Sunderam V (2010) Unibus: Aspects of heterogeneity and fault tolerance in cloud computing In: Parallel and Distributed Processing, Workshops and Phd Forum (IPDPSW), 1–10.

  43. Parnas DL (1975) The influence of software structure on reliability In: Proceedings of the international conference on Reliable software, 358–362, Los Angeles.

  44. Shooman ML (1976) Structural models for software reliability prediction In: Proceedings of the 2nd international conference on Software engineering, 268–280, San Fransisco,

  45. Engelmann C, Vallee GR, Naughton T, Scott SL (2009) Proactive Fault Tolerance using Preemptive Migration In: 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing, 252–257, Weimar.

  46. Avizienis A (1995) The methodology of n-version programming. In: Lyu MR (ed)Software Fault Tolerance, 23–46.. Wiley, Chichester.

    Google Scholar 

  47. Ramos BLC (2007) Challenging malicious inputs with fault tolerance techniques.

  48. Castro M, Liskov B (2002) Practical Byzantine Fault Tolerance and Proactive Recovery. ACM Trans Comput Syst (Assoc Comput Mach) 20(4):398–461. CiteSeerX:

  49. Kim K, Welch H (1989) Distributed execution of recovery blocks: An approach for uniform treatment of hardware and software faults in real-time applications. IEEE Trans Comput 38(5):626–63.

    Article  Google Scholar 

  50. Nicola VF (1995) Checkpointing and the Modeling of Program Execution Time. Software Fault Tolerance. Wiley.

  51. Petri S, Langendörfer H (1995) Load balancing and fault tolerance in workstation clusters migrating groups of communicating processes. ACM SIGOPS Operating Systems Review 29(4):25–36.

  52. Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web search engine In: Proceedings of the seventh international conference on World Wide Web 7, 107–117, Brisbane.

  53. Inoue K, Yokomori R, Yamamoto T, Matsushita M, Kusumoto S (2005) Ranking significance of software components based on use relations. IEEE Trans Softw Eng 31:213–225.

    Article  Google Scholar 

  54. Folk MJ, Riccardi G (1997) File Structures: An Object-Oriented Approach with C++. Addison-Wesley Publishing Co.

Download references


We thank following members of REVA University for continuous support and motivation: Our beloved Chancellor of REVA University Dr. P. Shyamaraju, Dr. V. G. Talawar, Advisor, Dr. S. Y. Kulkarni, Vice-Chancellor, Dr. M. Dhanamjaya, Registrar, Dr. N. Ramesh, Director of Placement Training and Planning, Dr. B. P. Divakar, Director of Research and Innovation, Dr. Sunilkumar S. Manvi, Director C and IT, and every member of C and IT. Also we thank Dr. N. R. Shetty, Advisor, NMIT, Dr. H. C. Nagaraj, Principal, NMIT for their constant encouragement and support. We thank G. Anitha for her comments on independent configurations and proofreading that helped us to improve the manuscript. We express our heartfelt thanks to the reviewers for their invaluable time spent on a review process and very critical suggestions which played significant role in improving the technical content of the manuscript and also helped us in the thinking process. We thank the editor and every member of the editorial office of the Journal of Cloud Computing, Springer for their continuous support while processing research article. Last but not the least we thank Manojkumar Pandey, Arunkumar Reddy, Swetha Ramapriya, Shristi Srivastsav for their assistance in coding files structures concepts.


Not Applicable.

Availability of data and materials

Data and materials related to this research article will be uploaded as early as possible in GitHub portal.

Author information

Authors and Affiliations



Both authors have made intellectual contribution to the research in the field of fault tolerant computing in cloud computing. Authors have conducted experiments and responsible to writing, editing manuscript. Both authors have read this research article and approved the final manuscript.

Corresponding author

Correspondence to Mylara Reddy Chinnaiah.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional information

Authors’ information

Mr. Mylara Reddy C. is an Assistant Professor at School of Computing and Information Technology, REVA University, Bangalore, Karnataka, India. He received M. Tech. degree from Visvesraya Technological University in 2006. He has published few papers in international conferences and peer reviewed journals such Journal of Computer Networks, Elsevier, International Journal of Computer and Electrical Engineering. He is pursuing Ph. D. as part time research scholar under the supervision of Dr. Nalini N. at Nitte Meenakshi Institute of Technology, Bangalore, affiliated to Visvesvaraya Technological University, Karnataka, India. His research interest include cloud computing, fault tolerance, Computer networks.

Nalini N. received her Ph.D. degree in Computer Science and Engineering from Visvesvaraya Technological University in 2006. She is working as professor in the Department of Computer Science and Engineering, Nitte Meenakshi Institute of Technology, Bangalore, India. She has published several research articles in international conferences and peer reviewed journals. She has successfully supervised six students for Ph.D. degree. She is the recipient of several awards from industry, social organizations and academia. Her research interest include security in wireless communication systems, fault tolerant computing in wireless mobile systems and sensor networks, cryptography and network security, Fault tolerance framework for cloud computing applications, machine learning.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chinnaiah, M., Niranjan, N. Fault tolerant software systems using software configurations for cloud computing. J Cloud Comp 7, 3 (2018).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Configurable software systems
  • Fault tolerance
  • Reliability
  • Configurations interactions