Читать книгу From Traditional Fault Tolerance to Blockchain - Wenbing Zhao - Страница 29

1.2.2 Fault Detection and Diagnosis

Fault detection is a crucial step in ensuring the dependability of a system. Crash faults are relatively trivial to detect, for example, we can periodically probe each component to check on its health. If no response is received after several consecutive probes, the component may be declared as having crashed. However, components in a system might fail in various ways and they might respond promptly to each probe after they have failed. It is nontrivial to detect such faults, especially in a large distributed system. Diagnosis is required to determine that a fault indeed has occurred and to localize the source of the fault (i.e., pinpoint the faulty component). To accomplish this, the distributed system is modeled, and sophisticated statistical tools are often used [3]. Some of the approaches in fault detection and diagnosis are introduced in Chapter 3.

A lot of progress has been made in modern programming language design to include some forms of software fault detection and handling, such as unexpected input or state. The most notable example is exception handling. A block of code can be enclosed with a try-catch construct. If an error condition occurs during the execution of the code, the catch block will be executed automatically. Exceptions may also be propagated upward through the calling chain. If an exception occurs and it is not handled by any developer-supplied code, the language runtime usually terminates the process.

The recovery block method, which is designed for software fault tolerance [8], may be considered as an extension of the programming language exception handling mechanism. An important step in recovery blocks is the acceptance testing, which is a form of fault detection. A developer is supposed to supply an acceptance test for each module of the system. When the acceptance test fails, a software fault is detected. Subsequently, an alternate block of code is executed, after which the acceptance test is evaluated again. Multiple alternate blocks of code may be provided to increase the robustness of the system.

From Traditional Fault Tolerance to Blockchain

Подняться наверх