A communication-induced checkpointing protocol that ensures rollback-dependency trackability
- 22 November 2002
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- No. 07313071,p. 68-77
- https://doi.org/10.1109/ftcs.1997.614079
Abstract
Considering an application in which processes take local checkpoints independently (called basic checkpoints), this paper develops a protocol that forces them to take some additional local checkpoints (called forced checkpoints) in order that the resulting checkpoint and communication pattern satisfies the Rollback Dependency Trackability (RDT) property. This property states that all dependencies between local checkpoints are on-line trackable by using a transitive dependency vector. Compared to other protocols ensuring the RDT property, the proposed protocol is less conservative in the sense that it takes less additional local checkpoints. It attains this goal by a subtle tracking of causal dependencies on already taken checkpoints; this tracking is then used to prevent the occurrence of hidden dependencies. As indicated by simulation study, the proposed protocol compares favorably with other protocols; moreover it additionally associates on-the-fly with each local checkpoint C the minimum global checkpoint to which C belongs.Keywords
This publication has 11 references indexed in Scilit:
- Volatile logging in n-fault-tolerant distributed systemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- A communication-induced checkpointing protocol that ensures rollback-dependency trackabilityPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Consistent global checkpoints that contain a given set of local checkpointsIEEE Transactions on Computers, 1997
- A unified framework for the specification and run-time detection of dynamic properties in distributed computationsJournal of Systems and Software, 1996
- Necessary and sufficient conditions for consistent global snapshotsIEEE Transactions on Parallel and Distributed Systems, 1995
- Recoverable distributed shared virtual memoryIEEE Transactions on Computers, 1990
- Checkpointing and Rollback-Recovery for Distributed SystemsIEEE Transactions on Software Engineering, 1987
- Distributed snapshotsACM Transactions on Computer Systems, 1985
- State Restoration in Systems of Communicating ProcessesIEEE Transactions on Software Engineering, 1980
- Time, clocks, and the ordering of events in a distributed systemCommunications of the ACM, 1978