Читать книгу From Traditional Fault Tolerance to Blockchain - Wenbing Zhao - Страница 48

EXAMPLE 2.3

Figure 2.6 Normal operation of the Tamir and Sequin checkpointing protocol in an example three-process distributed system.

To see how the checkpointing protocol works, consider the example shown in Figure 2.6. In this example, we assume that the distributed system consists of three processes, where the three processes are fully connected, i.e., P₀ has a connection with P₁, P₁ has a connection with P₂, and P₂ has a connection with P₀. Therefore, each process has two incoming channels and two outgoing channels connected to its two neighbors.

Assume process P₀ is the checkpointing coordinator. It initiates the global checkpointing by sending a CHECKPOINT message to P₁ and P₂, respectively, along the two outgoing channels. In the mean time, P₁ sends a regular message m₀ to P₀, and P₂ sends a regular message m₁ to P₁.

Upon receiving the CHECKPOINT message from P₀, P₁ stops normal execution and sends a CHECKPOINT message along each of its outgoing channel to P₀ and P₂, respectively. Similarly, P₂ sends the CHECKPOINT message to P₀ and P₁, respectively, once it receives the first CHECKPOINT message.

Due to the FIFO property of the connections, P₀ receives m₀ before it collects all the CHECKPOINT messages from all its incoming channels, and P₁ receives m₁ before it receives the CHECKPOINT messages from P₂. According to the protocol rule, such regular messages are logged instead of delivered because normal execution must be stopped once the global checkpointing is initiated. These logged messages will be appended to the local checkpoint once it is taken. In fact, such messages reflect the channel states of the distributed system. These messages won’t be delivered for execution until a process resumes normal execution.

When P₀ receives the CHECKPOINT messages from P₁ and P₂, it takes a local checkpoint, C_0,0 and append the message log to the checkpoint. Similarly, P₁ takes a local checkpoint when it receives the CHECKPOINT messages from P₀ and P₂, and P₂ takes a local checkpoint when it receives the CHECKPOINT messages from P₀ and P₁.

Subsequently, P₁ and P₂ send their SAVED messages to P₀, i.e., the global checkpointing coordinator. P₀ then informs P₁ and P₂ to resume normal execution with a RESUME message to each of them.

A more complicated distributed system in which some processes do not have direct connection with the coordinator will require some of the coordinator’s neighbors to relay the SAVED notification to the coordinator.

From Traditional Fault Tolerance to Blockchain

Подняться наверх