Diskless checkpointing
- 1 January 1998
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Parallel and Distributed Systems
- Vol. 9 (10) , 972-986
- https://doi.org/10.1109/71.730527
Abstract
Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems. In this paper, we motivate diskless checkpointing and present the basic diskless checkpointing scheme along with several variants for improved performance. The performance of the basic scheme and its variants is evaluated on a high-performance network of workstations and compared to traditional disk-based checkpointing. We conclude that diskless checkpointing is a desirable alternative to disk-based checkpointing that can improve the performance of distributed applications in the face of failures.Keywords
This publication has 28 references indexed in Scilit:
- A survey of rollback-recovery protocols in message-passing systemsACM Computing Surveys, 2002
- Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless CheckpointingJournal of Parallel and Distributed Computing, 1997
- Application Level Fault Tolerance in Heterogeneous Networks of WorkstationsJournal of Parallel and Distributed Computing, 1997
- Impact of checkpoint latency on overhead ratio of a checkpointing schemeIEEE Transactions on Computers, 1997
- Lightweight logging for lazy release consistent distributed shared memoryPublished by Association for Computing Machinery (ACM) ,1996
- RAID: high-performance, reliable secondary storageACM Computing Surveys, 1994
- Low-latency, concurrent checkpointing for parallel programsIEEE Transactions on Parallel and Distributed Systems, 1994
- Manetho: transparent roll back-recovery with low overhead, limited rollback, and fast output commitIEEE Transactions on Computers, 1992
- Failure correction techniques for large disk arraysPublished by Association for Computing Machinery (ACM) ,1989
- IGOR: a system for program debugging via reversible executionACM SIGPLAN Notices, 1988