Coordinated checkpointing without direct coordination
- 27 November 2002
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- No. 10872191,p. 23-31
- https://doi.org/10.1109/ipds.1998.707706
Abstract
Coordinated checkpointing is a well-known method to achieve fault tolerance in distributed systems. Long running parallel applications and high-availability applications are two potential users of checkpointing, although with different requirements. Parallel applications need low failure-free overheads, and high-availability applications require fast and bounded recoveries. In this paper we describe a new coordinated checkpoint protocol capable of satisfying both types of applications. The protocol uses time to avoid all types of direct coordination (e.g., message exchanges and message tagging), reducing the overheads to almost a minimum. To ensure that rapid recoveries can be attained the protocol guarantees small checkpoint latencies. The protocol was implemented and tested on a cluster of workstations connected by a 155 Mbit/sec ATM. Experimental results show that the protocol overheads are very small.Keywords
This publication has 21 references indexed in Scilit:
- Global checkpointing for distributed programsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- Lazy checkpoint coordination for bounding rollback propagationPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Using time to improve the performance of coordinated checkpointingPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Experimental assessment of parallel systemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- A timestamp-based checkpointing protocol for long-lived distributed computationsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- RENEW: a tool for fast and efficient implementation of checkpoint protocolsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Fault detection using hints from the socket layerPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Impact of checkpoint latency on overhead ratio of a checkpointing schemeIEEE Transactions on Computers, 1997
- An efficient checkpointing method for multicomputers with wormhole routingInternational Journal of Parallel Programming, 1991
- Checkpointing and Rollback-Recovery for Distributed SystemsIEEE Transactions on Software Engineering, 1987