Читать книгу From Traditional Fault Tolerance to Blockchain - Wenbing Zhao - Страница 47
2.2.2.1 Protocol Description.
ОглавлениеThe finite state machine specifications for the coordinator and the participant are provided in Figure 2.4 and Figure 2.5, respectively. Note that in the finite state machine specification for the coordinator as shown in Figure 2.4, the normal state is shown twice, once at the beginning (as ‘init’) and the other at the end, for clarity.
Figure 2.4 Finite state machine specification for the coordinator in the Tamir and Sequin checkpointing protocol.
More detailed explanation of the protocol rule for the coordinator and the participant is given below. In the description of the protocol, the messages exchanged between the processes in between two rounds of global checkpointing are referred to regular messages (and the corresponding execution is termed as normal execution), to differentiate them from the set of control messages introduced by the protocol for the purpose of coordination:
– CHECKPOINT message. It is used to initiate a global checkpoint. It is also used to establish a quiescent point of the distributed system where all processes have stopped normal execution.Figure 2.5 Finite state machine specification for the participant in the Tamir and Sequin checkpointing protocol.
– SAVED message. It is used for a participant to inform the coordinator that it has done a local checkpoint.
– FAULT message. It is used to indicate that a timeout has occurred and the current round of global checkpointing should be aborted.
– RESUME message. It is used by the coordinator to inform the participants that they now can resume normal execution.
Rule for the coordinator:
◾ At the beginning of the first phase, the coordinator stops its normal execution (including the sending of regular messages) and sends a CHECKPOINT message along each of its outgoing channel.
◾ The coordinator then waits for the corresponding CHECKPOINT message from all its incoming channels.– While waiting, the coordinator might receive regular messages. Such messages are logged and will be appended to the checkpoint of its state. This can only happen from an incoming channel from which the coordinator has not received the CHECKPOINT message.– The coordinate aborts the checkpointing round if it fails to receive the CHECKPOINT message from one or more incoming channels within a predefined time period.
◾ When the coordinator receives the CHECKPOINT message from all its incoming channels, it proceeds to take a checkpoint of its state.
◾ Then, the coordinator waits for a SAVED notification from every process (other than itself) in the distributed system. It aborts the checkpointing round if it fails to receive the SAVED message from one or more incoming channels within a predefined time period. It does so by sending a FAULT message along each of its outgoing channel. Note that it is impossible for the coordinator to receive any regular message at this stage.
◾ When the coordinator receives the SAVED notification from all other processes, it switches to the new checkpoint, and sends a RESUME message along each of its outgoing channel.
◾ The coordinator then resumes normal execution.
Rule for the participant:
◾ Upon receiving a CHECKPOINT notification, the participant stops its normal execution and in turn sends a CHECKPOINT message along each of its outgoing channel.
◾ The participant then waits for the corresponding CHECKPOINT message from all its incoming channels.– While waiting, the participant might receives regular messages. Such messages are logged and will be appended to the checkpoint of its state. Again, this can only happen from an incoming channel from which the participant has not received the CHECKPOINT message.– The participant aborts the checkpointing round by sending a FAULT message along each of its outgoing channel if it fails to receive the CHECKPOINT message from one or more incoming channels within a predefined time period.
◾ Once the participant has collected the set of CHECKPOINT messages, it takes a checkpoint of its state.
◾ The participant then sends a SAVED message to its upstream neighbor (from which the participant receives the first CHECKPOINT message), and waits for a RESUME message.
◾ Upon receiving a SAVED message (from one of its downstream neighbors), it relays the message to its upstream neighbor.
◾ When it receives a RESUME message, it propagates the message along all its outgoing channels except the one that connects to the process that sends it the message. The participant then resumes normal execution.