11. A. S. Tanenbaum and D. J. Wetherall. Computer Networks (5th Ed.). Pearson, 2010.
12. L. Tewksbury, L. Moser, and P. Melliar-Smith. Live upgrade techniques for corba applications. In New Developments in Distributed Applications and Interoperable Systems, volume 70 of IFIP International Federation for Information Processing, pages 257–271. Springer US, 2002.
2
Logging and Checkpointing
Checkpointing and logging are the most essential techniques to achieve dependability in distributed systems [7]. By themselves, they provide a form of fault tolerance that is relatively easy to implement and incurs low runtime overhead. Although some information could be lost (if only checkpointing is used) when a fault occurs and the recovery time after a fault is typically larger than that of more sophisticated fault tolerance approaches, it may be sufficient for many applications. Furthermore, they are used in all levels of dependability mechanisms.
A checkpoint of a distributed system refers to a copy of the system state [7]. If the checkpoint is available after the system fails, it can be used to recover the system to the state when the checkpoint was taken. Checkpointing refers to the action of taking a copy of the system state (periodically) and saving the checkpoint to a stable storage that can survive the faults tolerated.
To recover the system to the point right before it fails, other recovery information must be logged in addition to periodical checkpointing. Typically all incoming messages to the system are logged. Other nondeterministic events may have to be logged as well to ensure proper recovery.
Checkpointing and logging provide a form of rollback recovery [7] because they can recover the system to a state prior to the failure. In contrast, there exist other approaches that accomplish roll-forward recovery, that is, a failed process can be recovered to the current state by incorporating process redundancy into the system. However, roll-forward recovery protocols typically incur significantly higher runtime overhead and demand more physical resources.
2.1 System Model
In this section, we define the system model used in the checkpointing and logging algorithms introduced in this chapter. The algorithms are executed in a distributed system that consists of N number of processes. Processes within the system interact with each other by sending and receiving messages. These processes may also interact with the outside world by message exchanges. The input message to the distributed system from the outside world is often a request message sent by the user of the system. The output message from the system is the corresponding response message. An example distributed system consisting of 4 processes is shown in Figure 2.1.
Figure 2.1 An example distributed system.
2.1.1 Fault Model
In such a distributed system, a failure could occur at a process. However, it is assumed that when a process fails, it simply stops execution and loses all its volatile state (i.e., the fail-stop model [18] is used). In addition, it is assumed that any two processes can establish a reliable connection (such as a TCP connection) for communication. Even though the network may lose messages, the reliable channel can effectively mask such losses. Naturally, the reliable connection ensures the first-in first-out (FIFO) property between the two endpoints of the reliable connection. This assumption also implies that the network does not partition, i.e., it does not prevent two or more processes in the system from interacting with each other for extended period of time.
2.1.2 Process State and Global State
The state of an individual process is defined by its entire address space in an operating system. A generic checkpointing library (such as Condor [23]) normally saves the entire address space as a checkpoint of the process. Of course, not everything in the address space is interesting based on the application semantics. As such, the checkpoint of a process can be potentially made much smaller by exploiting application semantics.
The state of a distributed system is usually referred to as the global state of the system [5]. It is not a simple aggregation of the states of the processes in the distributed system because the processes exchange messages with each other, which means that a process may causally depend on some other processes. Such dependency must be preserved in a global state. Assume that each process in the distributed system takes checkpoints periodically, this implies that we may not be able to use the latest set of checkpoints for proper recovery should the processes fails, unless the checkpointing at different processes are coordinated [5]. To see why, considering the three scenarios illustrated in Figure 2.2 where the global state is constructed by using the three checkpoints, C0, C1, C2, taken at processes P0, P1, and P2, respectively.
Figure 2.2(a) shows a scenario in which the checkpoints taken by different processes are incompatible, and hence cannot be used to recover the system upon a failure. Let’s see why. In this scenario, P0 sends a message m0 to P1, and P1 subsequently sends a message m1 to P2. Therefore, the state of P2 potentially depends on the state of P1 after it has received m1, and the state of P1 may depend on that of P0 once it receives m0. The checkpoint C0 is taken before P0 sends the message m0 to P1, whereas the checkpoint C1 is taken after P1 has received m0. The checkpoints are not compatible because C1 reflects the receiving of m0 while C0 does not reflect the sending of m0, that is, the dependency is broken. Similarly, C2 reflects the receiving of m1 while C1 does not reflect the sending of m1.
Figure 2.2 Consistent and inconsistent global state examples.
EXAMPLE 2.1
To understand the problem better, consider the following example. Assume that P0 and P1 represent two bank accounts, A and B respectively. The purpose of m0 is to deposite $100 to account B after P0 has debited account A. P0 takes a checkpoint C0 before the debit operation, and P1 takes a checkpoint C1 after it has received and processed the deposit request (i.e., m0), as illustrated in Figure 2.2(a). If P0 crashes after sending the deposit request (m0), and P1 crashes after taking the checkpoint C1, upon recovery, P1’s state would reflect a deposit of $100 (from account A) while P0’s state would not reflect the corresponding debit operation. Consequently, $100 would appear