Checkpoint and rollback in asynchronous distributed systems
- 22 November 2002
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- Vol. 3 (0743166X) , 998-1005
- https://doi.org/10.1109/infcom.1997.631114
Abstract
This paper proposes a novel algorithm for taking checkpoints and rolling back the processes for recovery in asynchronous distributed systems. The algorithm has the following properties: (1) multiple processes can simultaneously initiate the checkpointing; (2) no additional message is transmitted for taking checkpoints; (3) a set of local checkpoints taken by multiple processes denotes a consistent global state; (4) multiple processes can initiate simultaneously the rollback recovery; (5) the minimum number of processes are rolled back; and (6) each process is rolled back asynchronously. The number of messages for rolling back the processes is O(l) where l is the number of channels. Therefore, the system is kept highly available by the algorithm presented.Keywords
This publication has 11 references indexed in Scilit:
- Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approachPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- An efficient protocol for checkpointing recovery in distributed systemsIEEE Transactions on Parallel and Distributed Systems, 1993
- Rollback recovery in distributed systems using loosely synchronized clocksIEEE Transactions on Parallel and Distributed Systems, 1992
- Efficient algorithms for crash recovery in distributed systemsPublished by Springer Nature ,1990
- Optimal checkpointing and local recording for domino-free rollback recoveryInformation Processing Letters, 1987
- Checkpointing and Rollback-Recovery for Distributed SystemsIEEE Transactions on Software Engineering, 1987
- Distributed snapshotsACM Transactions on Computer Systems, 1985
- Termination detection for diffusing computationsInformation Processing Letters, 1980
- Time, clocks, and the ordering of events in a distributed systemCommunications of the ACM, 1978
- System structure for software fault toleranceIEEE Transactions on Software Engineering, 1975