Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques

24 December 2002

conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

Abstract

Coordinated checkpointing systems are popular and general-purpose tools for implementing process migration, coarse-grained job swapping, and fault-tolerance on networks of workstations. Though simple in concept, there are several design decisions concerning the placement of checkpoint files that can impact the performance and functionality of coordinated checkpointers. Although several such checkpointers have been implemented for popular programming platforms like PVM and MPI, none have taken this issue into consideration. This paper addresses the issue of checkpoint placement and its impact on the performance and functionality of coordinated checkpointing systems. Several strategies, both old and new, are described and implemented on a network of SPARC-5 workstations running PVM. These strategies range from very simple to more complex borrowing heavily from ideas in RAID (Redundant Arrays of Inexpensive Disks) fault-tolerance. The results of this paper will serve as a guide so that future implementations of coordinated checkpointing can allow their users to achieve the combination of performance and functionality that is right for their applications.

Keywords

This publication has 26 references indexed in Scilit:

Message-optimal incremental snapshots
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2003
Optimistic message logging for independent checkpointing in message-passing systems
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2003
Efficient transparent optimistic rollback recovery for distributed application programs
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Adaptive independent checkpointing for reducing rollback propagation
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
CoCheck: checkpointing and process migration for MPI
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Impact of checkpoint latency on overhead ratio of a checkpointing scheme
IEEE Transactions on Computers, 1997
A case for two-level distributed recovery schemes
Published by Association for Computing Machinery (ACM) ,1995
Demonic memory for process histories
Published by Association for Computing Machinery (ACM) ,1989
IGOR: a system for program debugging via reversible execution
ACM SIGPLAN Notices, 1988
Optimistic recovery in distributed systems
ACM Transactions on Computer Systems, 1985