Cloud computing can be seen as as style of computing whereby data or programs are stored, accessed and processed over the internet. With its amazing characteristics such as on demand service, broad network access, resource pooling, rapid elasticity and many more, cloud computing has become widely adopted by various business and industries. This growth has lead to increase in the demand for cloud resources and services. One major characteristics of cloud computing is rapid elasticity which means that resources are provisioned and released on demand. In other words, reliability and availability of these resources should be achieved, a way to achieve this is by the introduction of fault tolerance to the system.
Fault tolerance in cloud can be defined as the ability of cloud to withstand unexpected changes (network congestion, hardware fault, software fault) which may occur without seriously affectung the performance of the system. Various models and approaches have proposed for fault tolerance. This paper is tailored towards discussing and comparing such models and approaches.
The methodology used in gathering information and preparation of this paper involved three steps: 1. Gathering related works; 2. Categorize works in clusters; finally, 3. Analysis of related works.
The first step involved, making use of library resources in searching well known scientific databases such as (IEEE, ACM, SIAM, Springer Link, LNCS and Science Direct) in order to gather articles and conference papers using keyword-based search. Keywords used in the search include “fault tolerance”, “cloud computing”, “failure detector”, “FT” also a combination such as “Fault tolerance in Cloud computing”. The main aim was to only pick the most cited papers between the years 2010 to 2017, also exceptions were made on popular papers such as [1], [13], [14], [17] which are relevant and serve as research point when studying fault tolerance in cloud. A total of 40 papers were gathered, but after carefully scrutinizing each of these papers, only 32 papers were used in the making of this survey paper.
The second step involved categorizing these papers gathered from the above step into clusters. Two clusters were created, papers gathered were categorized into these clusters based on the fault tolerance policy being adopted. It will be noted that a cluster may contain more works in it compared to the other, this is because 70% of the papers gathered adopted the reactive fault tolerance policy in developing its fault tolerance approach.
The final step involves examining each cluster as well as carefully studying and analyzing each paper separately.
I. fault tolerance in cloud computing
Fault tolerance in general can be defined as the ability of a system to provide required services with degraded performance in the presence of faults or system failure. In relating fault tolerance to cloud computing we can say fault tolerance in cloud computing is the ability for cloud to continue providing services even when unexpected changes (i.e. programming malfunction, network congestion, hardware or software faults) occur.
Over the years, several fault tolerance approaches have been proposed in order to improve cloud service reliability [1]. Redundancy serves as the key technique for fault tolerance and practically all fault tolerance approaches are based on the use of redundancy. Several survey papers on fault tolerance have shown that replication and check pointing are two techniques that are widely adopted.
Fault can be categorized based on the following factors:
• Network fault: these are faults that occur in the network, this may be due to link failure, packet loss, network partition, inadequate or improper network configuration etc.
• Physical fault: these are faults that occur in the hardware. This sort of fault can cause the system to fail instantly, hardware faults can be small and may go undetected till it leads to a serious fault. Such faults may include burned out cooling fan which may lead to overheating, fault in CPU, storage or memory.
• Application fault: this sort of fault occurs in the applications; the application may stop working or perform in an unusual manner.
• Processor fault: this sort of fault occurs in the processor, this may be caused when the operating system crashes
• Process fault: this occurs when the process running in the processor fails. This may be due to shortage of resources or software bugs/fault.
One major failure faced in cloud is VM failure. VM failure may lead to tasks not being executed or major reasons for VM failure include the following:
o Inadequate server resources: this occurs when there is over or insufficient server resources, this may lead to VM failure.
o Incompatible server hardware: this occurs when the Hardware is not compatible with the VM. From the virtualization concept, the VM abstract the image of the server hardware. Therefor it is important that the hardware supports the VM functionalities.
o Conflicting VM tasks: this occurs when the scheduled tasks assigned to a VM to be executed is not compatible with that VM. When this happens, it takes a long time for a timeout error to be generated causing new tasks not to be executed, because execution process of the incompible task continues to run in the background.
II. fault tolerance requirement
Various parameters are to be considered when developing a fault tolerance technique, these parameters include: throughput, performance, scalability, response-time, security and availability
• Throughput: defines the number of tasks that can be executed at a given time. The value of throughput should be high. The formula below can be used in calculating the throughput of a system, where refers to the number of tasks and refers to the time taken to successfully execute tasks.
• Response time: this is the amount of time taken for the proposed fault tolerance technique to respond. The value for response time should be minimized [2].
• Reliability: this aspect aims to provide a correct and acceptable result within a required period of time. In other words, it refers to the attribute of a system being able to run continuously without failure [3].
• Performance: this parameter checks the efficiency of the system. This can be improved by accommodating acceptable delays while reducing response time [4].
• Scalability: this parameter aims to define the capability of an algorithm being able to tolerate fault without regard of the number of nodes in the system.
III. cloud fault tolerance policies
Over the years, numerous approaches have been proposed to solve fault tolerance in cloud. These approaches can be categorized into two types namely Proactive Fault Tolerance and Reactive Fault Tolerance, they are both known as Cloud Fault Tolerance Policies. In this paper, we would consider these policies as clusters to categorize fault tolerance approaches. Fig. 1 shows the fault tolerance policies and the techniques associated with each of them.
Proactive fault tolerance policy avoids failure by predicting faults before they occur and employing preventive measures either by replacing alleged faulty component with an operational component before fault occurs. Proactive fault tolerance ensures that the system runs smoothly and jobs get done without experiencing fault. Software rejuvenation, self-healing and Preemptive migration are techniques for proactive fault tolerance and are briefly discussed below.
• Software rejuvenation: this technique involves the system to go through periodic reboots, after each reboot the system starts with a new state [5]. In this technique assumptions are made as to when the system is expected to fail. A short rejuvenation interval is set as it is meant to end before the system fails, therefore at the end of each rejuvenation interval the system is restarted and the operating environment as well as all the processes will be reinitialized.
• Self-healing: in this technique, several instances or a particular application will be running in sequence on different virtual machines. This can automatically handle failure on the application [4].
• Preemptive Migration: in this technique, a feedback-loop mechanism is used which requires the applications to be constantly monitored and analyzed. An executed job is preempted and its state is saved and migrated to another system [4].
A. Reactive Fault Tolerance
Reactive Fault Tolerance policy measures are applied to reduce the effect of the fault that already happened in the cloud, as well as recover from fault. This policy leads to performance degradation or temporal service downtime. However, current fault tolerance techniques employ this policy and commonly depend on checkpoint/restart mechanism.
• Restart: this technique operates on the application level, if a task is suspected to have failed, the task would be aborted and restarted [7]. A timeout mechanism is used to trigger the abort and restart process, the timeout has to be chosen carefully because a short timeout might lead to a task being aborted before completion but on the other hand if the timeout is long it might lead to unnecessary late.
• Checkpointing: this is more like the restart technique but here a preventive component is utilized to avoid fully restarting the task or process. At regular time interval the system state is being saved, in other words a checkpoint is being added at each time interval. Whenever the failure occurs the system rolls back to the most recent checkpoint. Therefore, any recent work or process performed after the checkpoint is lost.
• Replication: this techniques requires making different copies of a particular data and running these various copies of this data on different resources until the task is crashed or completed. This technique adds redundancy to the system.
• Task resubmission: this technique handles fault by resubmitting the task to the same resource at which it was previously running on or a different resource. Task resubmission technique is the most widely used in scientific workflow system [5].
• Job migration: this technique involves migrating a failed job from one machine to another, possibly because that exact job cannot be executed on that machine or the machine does not have the required resources that process that job. Therefore the job is moved to another machine with enough resource to execute it.
The proactive and reactive fault tolerance policies both have their pros and cons. Even though it is more efficient to use proactive techniques to proactively predict faults and take measures to prevent these faults from occurring, unfortunately these techniques are not often used compared to the reactive techniques which can cause performance degradation and service downtime in the system. This is because reactive fault tolerance techniques are relatively simple to implement compared to proactive techniques. Also, it is should be noted that reactive fault tolerance techniques may not scale properly in systems low availability of VMs because once a failure occurs on one VM, the availability of other VMs in the system would decrease, this would be evident in techniques such job migration, replication or task resubmission.
Table I. shows the comparison of various fault tolerance policies alongside their techniques and the concept behind these techniques. Various tools are used to implement fault tolerance techniques. Table II contains a detailed comparison of these various tools in terms of their programming framework, the type of fault tolerance techniques in which they are used for and also the environment
In this section, various failure detection strategies would be described and also evaluated. Based on results gotten from my research on various proposed fault tolerance mechanisms the heartbeat failure detection strategy is the most widely used. This section also answers why the heartbeat strategy is the most widely adopted. The system model considers a distributed system with a finite set of processes. These processes are connected via a reliable communication channels and they communicate by sending and receiving packets. It is also assumed that each of these processes support IP Multicast communication.
• Heartbeat strategy: this is the most widely used fault detector strategy. In this strategy, every process will send “I am alive” every time interval to a process q. If the process does not receive the “I am alive” message before the time elapse, process will add process to its list if suspected processes, but if later receives an “I am alive” message from p, would take out of the list of suspected processes. According to the description of the strategy, two parameters would be set: the heartbeat period and the timeout delay. The heartbeat period is the time interval for which an “I am alive” message is sent, on the other hand the timeout delay is the time it should take for process to send “I am alive” to .
• Pinging strategy: this strategy is just like the heartbeat strategy. But in this strategy a process monitor’s process p by periodically sending “are you alive?” messages, p is expected to reply with an “I am alive” message within a set time. If p times out without sending an “I am alive” message, then q adds p to its list of suspected processes, but if later receives an “I am alive” message from, would take out of the list of suspected processes. Just like in the heartbeat strategy, two parameters would be set: the interrogation period and the timeout delay. The interrogation period is the time interval for which q sends an “Are you alive?” message to p, on the other hand the timeout delay is the time it should take for process to send “I am alive” to .
There are quite a number of advantages of using heartbeat strategy over the pinging strategy. Firstly, in the pinging strategy, more messages are sent as compared to the heartbeat strategy. Lastly, in estimating the timeout delay for both strategies the heartbeat strategy would be much easier because, in heartbeat strategy only the transmission delay of the “I am alive” messages will be considered. On the other hand, in the pinging strategy the transmission delay of the “are you alive?” messages as well as the transmission delay of the “I am alive” messages will be considered.
There are some existing fault detector using heartbeat strategies, these strategies include:
Chen FD [5]: this proposed strategy uses the arrival time of past heartbeat in order to estimate the arrival time of the next heartbeat. Using this estimation and expected time is set along with a constant safety margin. The model makes use of this
constant safety margin because the model presents a probabilistic behavior.
Bertier FD [13]: the proposed failure detector in this paper makes use of two models. The first is the Chen FD model from the above and the second is Jacobson’s estimation [13, 14]. The first is used to estimate the expected arrival time, while the second is used to estimate the safety margin each time it receives a heartbeat message.
Accrual FD [14]: unlike the above models, the accrual does not estimate arrival time, instead the accrual FD estimates the mean, the variance and distributions (such as Erlang distribution or normal distribution) it makes use of suspicion level along with two properties: Property 1 (Accruement) and Property 2 (Upper Bound), these properties contrast each other and one must be satisfied. Property 1 is satisfied if a process (assume process p) crashes and stops sending heartbeat message, this would make suspicion level to increase. Finally, Property 2 is satisfied if the process p sends heartbeat message at the correct time.
I. Realted work
A. Proactive fault tolerance:
N. Xiong et al. [12] proposed a dynamic failure detection scheme called Self-tuning Failure Detector (SFD). SFD can handle unexpected network conditions and the requirements of any number of concurently running applications. According to the authors extensive experiments were performed in order to compare the quality of service of the scheme with exisiting Fault Detector Schemes (such as: [5], [13], [14]), actual results from the experiment show that SFD can automaticaly adjust control parameters to obtain consistent services as well as maintaining good performance. Lastly according to the authors the scheme is said to be suitable for “one monitor multiple” and “multiple monitor multiple” cases based on parallel theory.
P. Egwutuoha et al. [15] proposed a Fault Tolerance algorithm for High Performance Computing (HPC) systems in cloud in order to minimize the wall clock execution time in the presence of faults. According to the algorithm spare nodes prior to predition of a failure is not relied on. The algorithm achieves fault tolerance by making use of an avoidance mechanism which relies on system log and health monitoring facilities.
Jialei et al [6] proposed an approach called PCFT (Proactive Fault Tolerance) which use a health monitoring mechanism to monitor physical machines. The archeture of the PCFT approach consist of two modules: Physical Machine (PM) fault prediction and Optimal target PM selection. The first module monitors CPU temperature and also possess a prediction functionality for detecting detiorating a PM, the second involves the PM searching for an optimal physical machines to host the VMs of the detoriating physical machine.
Lin et al. [16] the paper presented an efficient adaptive failure detection mechanism based on volterra series. The mechanism predicts failure by statistically comparing heartbeat arrival rates. in the system the fault detection unit comprise of two modules which are the adaptive prediction module and the decision module. The adaptive prediction module which predicts failure by employing a time series modelling which is based on heartbeat arrival time and also builds heartbeat arival patterns by comparing heartbeat received directly from a node and the prediction gotten from the time series model. The decision module is used to decide if the system would fail or not based on a decision tree algorithm. Lastly the system was evaluated in Beijing and Guangzhou expriment environement, the authors also implemented the self adaptive algorithm [17] and ARIMA algorithm in other to evaluate their algorithm along side these two. Chen’s algorithm and the proposed algorithm had similar pattern in the rate of prediction but two peaks was noticed in Chen’s algorithm which was a little higher than that of the proposed system. There was no similarity when between the proposed algithm and the ARIMA algorithm.
Wenbing Zhao et al. [18] proposed The Low Latency Fault Tolerance (LLFT) middleware, which utilizes leader/follower replication approach to offer fault tolerance for distributed applications deployed in cloud computing environment. The LLFT middleware integrates a Low Latency Messaging Protocol that provides totally ordered message delivery service, a Leader-Determined Membership Protocol which makes certain that replica groups have coherent view of their membership and finally the Virtual Determinizer Framework which ensures proper ordering of information at the primary replica.
Da-wei Sun et al. [19] Proposed a Dynamic Adaptive Fault Tolerance strategy (DAFT). The authors formulated a mathematical model that describes the relationship between system availability and the number of replicas, also a Dynamic Data Replication (D2R) algorithm was proposed, this algorithm proved that it was able to increase system availability and as well minimize the bandwidth which is being consumed in the cloud.
Ravi Jawahar et al. [20] proposed an approach which uses a conceptual framework called the Fault Tolerance Manager. This framework perceives fault tolerance mechanisms as separate modules which are independent, validating fault tolerance properties of these various mechanisms and validating user’s requirements with the available tolerance mechanisms to obtain an all-inclusive solution with desired properties. The approach requires addition of a service layer which offers the required properties to provide refurbish support to its applications, thereby exploiting the virtualization layer.
Anju Bala et al. [21] proposed an intelligent failure prediction models which makes use of machine learning approaches. The model facilitates proactive fault tolerance in executing scientific workflow by intelligently predicting task failures, by proactively analyzing the data of the scientific workflows by means of machine learning approaches such as logistic regression, ANN, Naïve Bayes and random forest. The model is composed of two modules; the first module uses machine learning approaches to predict task failures and in the second module the failures are located after workflow has been executed in the cloud test bed.
Sagar C. Joshi et al. [22] Proposed a fault tolerance mechanism for virtual datacenter architectures, which requires migrating virtual machines being hosted on failed server to a new location. Furthermore, a load balancing scheme was proposed, the scheme was based on clustering for efficient allocation of VDCs on the data center.
Alain Tchana et al. [23] proposed a fault tolerance approach which required two visions or cloud tolerance management: the first was fault tolerance should be managed by either the cloud customer or provider; and the second entails that responsibility should be shared between the cloud providers and customers. The paper also contained a review of possible failure situations in cloud (application, virtual machine and hardware failure). According to the authors, application failures are only detected and fixed at the customer level, while the VM and Hardware failure can be detected and repaired at the cloud customer or provider level.
K. Ganga et al. [24] presented different fault tolerance techniques in cloud computing and explains how task replication techniques in cloud computing in scientific workflow.
J. Al-Jaroodi et al. [25] put forward a delay-tolerant fault-tolerance algorithm (An Efficient Fault Tolerance Algorithm) for distributed cloud services. The algorithm provides fault tolerance and load balacing mechanisms by reduing execution time and minimizing fault discovery and recovery overhead in cloud handling distributed tasks. According to the paper the algorithm is to be used in Cloud which handle distributed tasks where data is downloaded from replicated servers and is executed in parallel on multiple independent servers.
Y. Zhang et al. [26] proposed Byzantine Fault Tolerant Framework (BFTCloud) for voluntary-resource cloud computing by using replication techniques. Voluntary nodes selection depends on QoS characteristics and reliability performance. According to the results obtained by the authors after extensive experiments show BFTCloud is effective on guaranteeing robustness and system reliability in cloud environment.
Gang Chen et al. [5] presented SHelp, a lightweight runtime system that is capable of surviving software faults when running server applications for different Virtual Machines hosted on one physical machine. SHelp makes use of checkpoint/restart as a Fault tolerance technique. Furthermore, in order to make the system perform more effectively and efficiently the authors introduced two new techniques: weighted rescue point and two-level rescue point database. The weighted enables the system effectively recover from fault by rolling back to the latest checkpoint by using weight values. In two-level rescue point database, applications in different virtual machines get fault related information to enable quick recovery of future faults if encountered.
G. Fan et al. [27] proposed a model based Byzantine fault detection technique. The system archecture of Byzantine fault detection technique is composed of two concepts, namely; Petri nets which is used in the underlying formalism and lasly Cloud computing Fault Net (CFN) which is used to model different components of cloud computing. Also a fault detection strategy was proposed for cloud applications, this strategy detects fault at runtime.
P. Das et al. [28] put forward a Virtualization and Fault Tolerance (VFT) technique. VFT reduces service time and increases system availability. It consists of a Cloud Manager (CM) module and Decision Maker (DM) which are used to handle faults, manage loadbalancing and virtualization. VFT is a reactive fault tolerance approach and the basic mechanism that was used to achieve fault tolerance in VFT is restart.
Fault tolerance provides availability and reliability to the cloud environment. As more research is being made on different ways to improve fault tolerance in the cloud environment, various fault tolerance models and approaches are being proposed. This paper provides a survey on the available fault tolerance policies, alongside the techniques associated with these policies. Tools for implementing these techniques were also discussed. And finally, existing fault tolerance approaches were discussed, analyzed and compared against each other. The paper follows a well-organized structure and contains in-depth information that would be of use for researchers looking to propose a fault tolerance approach.