Consistent global checkpoints that contain a given set of local checkpoints
- 1 April 1997
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Computers
- Vol. 46 (4) , 456-468
- https://doi.org/10.1109/12.588059
Abstract
In this paper, we consider the problem of constructing consistent global checkpoints that contain a given set of checkpoints. We address three important issues related to this problem. First, we define the maximum and minimum consistent global checkpoints containing a set S, and give algorithms to construct them. These algorithms are based on reachability analysis on a rollback-dependency graph. Second, we introduce a concept called "rollback-dependency trackability" that enables this analysis to be performed efficiently for a certain class of checkpoint and communication models. We define the least stringent of these models ("FDAS"), and put it in context with other models defined in the literature. Significant in this is a way to use FDAS to provide efficient rollback recovery for applications that do not satisfy perfect piecewise determinism. Finally, we describe several applications of the theorems and algorithms derived in this paper to demonstrate the capability of our approach to unify, generalize, and extend many previous works.Keywords
This publication has 29 references indexed in Scilit:
- A recoverable object storePublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- Crash recovery with little overheadPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Checkpointing and its applicationsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Message logging: pessimistic, optimistic, and causalPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Necessary and sufficient conditions for consistent global snapshotsIEEE Transactions on Parallel and Distributed Systems, 1995
- Manetho: transparent roll back-recovery with low overhead, limited rollback, and fast output commitIEEE Transactions on Computers, 1992
- Error recovery in shared memory multiprocessors using private cachesIEEE Transactions on Parallel and Distributed Systems, 1990
- Deadlock detection in distributed databasesACM Computing Surveys, 1987
- Optimistic recovery in distributed systemsACM Transactions on Computer Systems, 1985
- State Restoration in Systems of Communicating ProcessesIEEE Transactions on Software Engineering, 1980