Читать книгу From Traditional Fault Tolerance to Blockchain - Wenbing Zhao - Страница 31
1.2.4 Fault Tolerance
ОглавлениеRobust software itself is normally insufficient to delivery high dependability because of the possibility of hardware failures. Unless a distributed system is strictly stateless, simply restarting the system after a failure would not automatically restore its state to what it had before the failure. Hence, fault tolerance techniques are essential to improve the dependability of distributed systems to the next level.
There are different fault tolerance techniques that can be used to cater to different levels of dependability requirements. For applications that need high availability, but not necessarily high reliability, logging and checkpointing (which is the topic of Chapter 2), which incurs minimum runtime overhead and uses minimum extra resources, might be sufficient. More demanding applications could adopt the recovery oriented computing techniques (which is the topic of Chapter 3). Both types of fault tolerance techniques rely on rollback recovery. After restarting a failed system, the most recent correct state (referred to as a checkpoint) of the system is located in the log and the system is restored to this correct state.
An example scenario of rollback recovery is illustrated in Figure 1.2. When a system fails, it takes some time to detect the failure. Subsequently, the system is restarted and the most recent checkpoint in the log is used to recover the system back to that checkpoint. If there are logged requests, these requests are reexecuted by the system, after which the recovery is completed. The system then resumes handling new requests.
For a distributed system that requires high reliability, i.e., continuous correct services, redundant instances of the system must be used so that the system can continue operating correctly even if a portion of redundant copies (referred to as replicas) fail. Using redundant instances (referred to as replicas) also makes it possible to tolerate malicious faults provided that the replicas fail independently. When the failed replica is repaired, it can be incorporated back into the system by rolling its state forward to the current state of other replicas. This recovery strategy is called rollforward recovery.
Figure 1.2 The rollback recovery is enabled by periodically taking checkpoints and usually logging of the requests received.
An example scenario of rollforward recovery is shown in Figure 1.3. When the failure of the replica is detected and the replica is restarted (possibly after being repaired). To readmit the restarted replica into the system, a nonfaulty replica takes a checkpoint of its state and transfer the checkpoint to the recovering replica. The restarted replica can rollforward its state using the received checkpoint, which represents the latest state of the system.
Figure 1.3 With redundant instances in the system, the failure of a replica in some cases can be masked and the system continue providing services to its clients without any disruption.
To avoid common mode failures (i.e., correlated faults), it helps if each replica could execute a different version of the system code. This strategy is referred to as n-version programming [1]. Program transformation may also be used to achieve diversified replicas with lower software development cost [4]. A special form of n-version programming appears in the recovery block method for software fault tolerance [8]. Instead of using different versions of the software in different replicas, each module of the system is equipped with a main version and one or more alternate versions. At the end of the execution of the main version, an acceptance test is evaluated. If the testing fails, the first alternate version is executed and the acceptance test is evaluated again. This goes on until all alternate versions have been exhausted, in which case, the module returns an error.