Checkpoint and rollback in asynchronous distributed systems

22 November 2002

conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

Vol. 3 (0743166X) , 998-1005
https://doi.org/10.1109/infcom.1997.631114

Abstract

This paper proposes a novel algorithm for taking checkpoints and rolling back the processes for recovery in asynchronous distributed systems. The algorithm has the following properties: (1) multiple processes can simultaneously initiate the checkpointing; (2) no additional message is transmitted for taking checkpoints; (3) a set of local checkpoints taken by multiple processes denotes a consistent global state; (4) multiple processes can initiate simultaneously the rollback recovery; (5) the minimum number of processes are rolled back; and (6) each process is rolled back asynchronously. The number of messages for rolling back the processes is O(l) where l is the number of channels. Therefore, the system is kept highly available by the algorithm presented.

Keywords

This publication has 11 references indexed in Scilit:

Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2003
An efficient protocol for checkpointing recovery in distributed systems
IEEE Transactions on Parallel and Distributed Systems, 1993
Rollback recovery in distributed systems using loosely synchronized clocks
IEEE Transactions on Parallel and Distributed Systems, 1992
Efficient algorithms for crash recovery in distributed systems
Published by Springer Nature ,1990
Optimal checkpointing and local recording for domino-free rollback recovery
Information Processing Letters, 1987
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering, 1987
Distributed snapshots
ACM Transactions on Computer Systems, 1985
Termination detection for diffusing computations
Information Processing Letters, 1980
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM, 1978
System structure for software fault tolerance
IEEE Transactions on Software Engineering, 1975