Automated application-level checkpointing of MPI programs
- 11 June 2003
- proceedings article
- Published by Association for Computing Machinery (ACM)
- Vol. 38 (10) , 84-94
- https://doi.org/10.1145/781498.781513
Abstract
The running times of many computational science applications, such as protein-folding using ab initio methods, are much longer than the mean-time-to-failure of high-performance computing platforms. To run to completion, therefore, these applications must tolerate hardware failures.In this paper, we focus on the stopping failure model in which a faulty process hangs and stops responding to the rest of the system. We argue that tolerating such faults is best done by an approach called application-level coordinated non-blocking checkpointing, and that existing fault-tolerance protocols in the literature are not suitable for implementing this approach.We then present a suitable protocol, which is implemented by a co-ordination layer that sits between the application program and the MPI library. We show how this protocol can be used with a precompiler that instruments C/MPI programs to save application and MPI library state. An advantage of our approach is that it is independent of the MPI implementation. We present experimental results that argue that the overhead of using our system can be small.Keywords
This publication has 7 references indexed in Scilit:
- Collective operations in application-level fault-tolerant MPIPublished by Association for Computing Machinery (ACM) ,2003
- A network-failure-tolerant message-passing system for terascale clustersPublished by Association for Computing Machinery (ACM) ,2002
- On scalable and efficient distributed failure detectorsPublished by Association for Computing Machinery (ACM) ,2001
- Application Level Fault Tolerance in Heterogeneous Networks of WorkstationsJournal of Parallel and Distributed Computing, 1997
- Manetho: transparent roll back-recovery with low overhead, limited rollback, and fast output commitIEEE Transactions on Computers, 1992
- Transparent optimistic rollback recoveryACM SIGOPS Operating Systems Review, 1991
- Distributed snapshotsACM Transactions on Computer Systems, 1985