Checkpointing and Rollback-Recovery for Distributed Systems

1 January 1987

journal article
Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Software Engineering

Vol. SE-13 (1) , 23-31
https://doi.org/10.1109/tse.1987.232562

Abstract

We consider the problem of bringing a distributed system to a consistent state after transient failures. We address the two components of this problem by describing a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery algorithm to recover the system to a consistent state. In contrast to previous algorithms, they tolerate failures that occur during their executions. Furthermore, when a process takes a checkpoint, a minimal number of additional processes are forced to take checkpoints. Similarly, when a process rolls back and restarts after a failure, a minimal number of additional processes are forced to roll back with it. Our algorithms require each process to store at most two checkpoints in stable storage. This storage requirement is shown to be minimal under general assumptions.

Keywords

This publication has 11 references indexed in Scilit:

Low cost management of replicated data in fault-tolerant distributed systems
ACM Transactions on Computer Systems, 1986
Optimistic recovery in distributed systems
ACM Transactions on Computer Systems, 1985
Distributed snapshots
ACM Transactions on Computer Systems, 1985
The inherent cost of nonblocking commitment
Published by Association for Computing Machinery (ACM) ,1983
Global States of a Distributed System
IEEE Transactions on Software Engineering, 1982
State Restoration in Systems of Communicating Processes
IEEE Transactions on Software Engineering, 1980
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM, 1978
Reliability Issues in Computing System Design
ACM Computing Surveys, 1978
Process backup in producer-consumer systems
Published by Association for Computing Machinery (ACM) ,1977
System structure for software fault tolerance
IEEE Transactions on Software Engineering, 1975