Читать книгу From Traditional Fault Tolerance to Blockchain - Wenbing Zhao - Страница 35

2
Logging and Checkpointing

Оглавление

Checkpointing and logging are the most essential techniques to achieve dependability in distributed systems [7]. By themselves, they provide a form of fault tolerance that is relatively easy to implement and incurs low runtime overhead. Although some information could be lost (if only checkpointing is used) when a fault occurs and the recovery time after a fault is typically larger than that of more sophisticated fault tolerance approaches, it may be sufficient for many applications. Furthermore, they are used in all levels of dependability mechanisms.

A checkpoint of a distributed system refers to a copy of the system state [7]. If the checkpoint is available after the system fails, it can be used to recover the system to the state when the checkpoint was taken. Checkpointing refers to the action of taking a copy of the system state (periodically) and saving the checkpoint to a stable storage that can survive the faults tolerated.

To recover the system to the point right before it fails, other recovery information must be logged in addition to periodical checkpointing. Typically all incoming messages to the system are logged. Other nondeterministic events may have to be logged as well to ensure proper recovery.

Checkpointing and logging provide a form of rollback recovery [7] because they can recover the system to a state prior to the failure. In contrast, there exist other approaches that accomplish roll-forward recovery, that is, a failed process can be recovered to the current state by incorporating process redundancy into the system. However, roll-forward recovery protocols typically incur significantly higher runtime overhead and demand more physical resources.

From Traditional Fault Tolerance to Blockchain

Подняться наверх