Checkpointing and Rollback-Recovery for Distributed Systems
- 1 January 1987
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Software Engineering
- Vol. SE-13 (1) , 23-31
- https://doi.org/10.1109/tse.1987.232562
Abstract
We consider the problem of bringing a distributed system to a consistent state after transient failures. We address the two components of this problem by describing a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery algorithm to recover the system to a consistent state. In contrast to previous algorithms, they tolerate failures that occur during their executions. Furthermore, when a process takes a checkpoint, a minimal number of additional processes are forced to take checkpoints. Similarly, when a process rolls back and restarts after a failure, a minimal number of additional processes are forced to roll back with it. Our algorithms require each process to store at most two checkpoints in stable storage. This storage requirement is shown to be minimal under general assumptions.Keywords
This publication has 11 references indexed in Scilit:
- Low cost management of replicated data in fault-tolerant distributed systemsACM Transactions on Computer Systems, 1986
- Optimistic recovery in distributed systemsACM Transactions on Computer Systems, 1985
- Distributed snapshotsACM Transactions on Computer Systems, 1985
- The inherent cost of nonblocking commitmentPublished by Association for Computing Machinery (ACM) ,1983
- Global States of a Distributed SystemIEEE Transactions on Software Engineering, 1982
- State Restoration in Systems of Communicating ProcessesIEEE Transactions on Software Engineering, 1980
- Time, clocks, and the ordering of events in a distributed systemCommunications of the ACM, 1978
- Reliability Issues in Computing System DesignACM Computing Surveys, 1978
- Process backup in producer-consumer systemsPublished by Association for Computing Machinery (ACM) ,1977
- System structure for software fault toleranceIEEE Transactions on Software Engineering, 1975