Collective operations in application-level fault-tolerant MPI
- 23 June 2003
- conference paper
- Published by Association for Computing Machinery (ACM)
- p. 234-243
- https://doi.org/10.1145/782814.782847
Abstract
Fault-tolerance is becoming a critical issue on high-performance platforms. Checkpointing techniques make programs fault-tolerant by saving their state periodically and restoring this state after failure. System-level checkpointing saves the state of the entire machine on stable storage, but this usually has too much overhead. In practice, programmers do manual checkpointing by writing code to (i) save the values of key program variables at critical points in the program, and (ii) restore the entire computational state from these values during recovery. However, this can be difficult to do in general MPI programs without global barriers.In an earlier paper, we presented a distributed checkpoint coordination protocol which handles MPI's point-to-point constructs, while dealing with the unique challenges of application-level checkpointing. The protocol is implemented by a thin software layer that sits between the application program and the MPI library, so it does not require any modifications to the MPI library. However, it did not handle collective communication, which is a very important part of MPI. In this paper, we extend the protocol to handle MPI's collective communication constructs. We also present experimental results that show that the overhead introduced by the protocol for collective operations is small.Keywords
This publication has 3 references indexed in Scilit:
- Automated application-level checkpointing of MPI programsPublished by Association for Computing Machinery (ACM) ,2003
- Manetho: transparent roll back-recovery with low overhead, limited rollback, and fast output commitIEEE Transactions on Computers, 1992
- Distributed snapshotsACM Transactions on Computer Systems, 1985