Coordinated checkpointing without direct coordination

27 November 2002

conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

No. 10872191,p. 23-31
https://doi.org/10.1109/ipds.1998.707706

Abstract

Coordinated checkpointing is a well-known method to achieve fault tolerance in distributed systems. Long running parallel applications and high-availability applications are two potential users of checkpointing, although with different requirements. Parallel applications need low failure-free overheads, and high-availability applications require fast and bounded recoveries. In this paper we describe a new coordinated checkpoint protocol capable of satisfying both types of applications. The protocol uses time to avoid all types of direct coordination (e.g., message exchanges and message tagging), reducing the overheads to almost a minimum. To ensure that rapid recoveries can be attained the protocol guarantees small checkpoint latencies. The protocol was implemented and tested on a cluster of workstations connected by a 155 Mbit/sec ATM. Experimental results show that the protocol overheads are very small.

Keywords

This publication has 21 references indexed in Scilit:

Global checkpointing for distributed programs
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2003
Lazy checkpoint coordination for bounding rollback propagation
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Using time to improve the performance of coordinated checkpointing
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Experimental assessment of parallel systems
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
A timestamp-based checkpointing protocol for long-lived distributed computations
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
RENEW: a tool for fast and efficient implementation of checkpoint protocols
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Fault detection using hints from the socket layer
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Impact of checkpoint latency on overhead ratio of a checkpointing scheme
IEEE Transactions on Computers, 1997
An efficient checkpointing method for multicomputers with wormhole routing
International Journal of Parallel Programming, 1991
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering, 1987