Collective operations in application-level fault-tolerant MPI

23 June 2003

conference paper
Published by Association for Computing Machinery (ACM)

p. 234-243
https://doi.org/10.1145/782814.782847

Abstract

Fault-tolerance is becoming a critical issue on high-performance platforms. Checkpointing techniques make programs fault-tolerant by saving their state periodically and restoring this state after failure. System-level checkpointing saves the state of the entire machine on stable storage, but this usually has too much overhead. In practice, programmers do manual checkpointing by writing code to (i) save the values of key program variables at critical points in the program, and (ii) restore the entire computational state from these values during recovery. However, this can be difficult to do in general MPI programs without global barriers.In an earlier paper, we presented a distributed checkpoint coordination protocol which handles MPI's point-to-point constructs, while dealing with the unique challenges of application-level checkpointing. The protocol is implemented by a thin software layer that sits between the application program and the MPI library, so it does not require any modifications to the MPI library. However, it did not handle collective communication, which is a very important part of MPI. In this paper, we extend the protocol to handle MPI's collective communication constructs. We also present experimental results that show that the overhead introduced by the protocol for collective operations is small.

Keywords

This publication has 3 references indexed in Scilit:

Automated application-level checkpointing of MPI programs
Published by Association for Computing Machinery (ACM) ,2003
Manetho: transparent roll back-recovery with low overhead, limited rollback, and fast output commit
IEEE Transactions on Computers, 1992
Distributed snapshots
ACM Transactions on Computer Systems, 1985