Automated application-level checkpointing of MPI programs

11 June 2003

proceedings article
Published by Association for Computing Machinery (ACM)

Vol. 38 (10) , 84-94
https://doi.org/10.1145/781498.781513

Abstract

The running times of many computational science applications, such as protein-folding using ab initio methods, are much longer than the mean-time-to-failure of high-performance computing platforms. To run to completion, therefore, these applications must tolerate hardware failures.In this paper, we focus on the stopping failure model in which a faulty process hangs and stops responding to the rest of the system. We argue that tolerating such faults is best done by an approach called application-level coordinated non-blocking checkpointing, and that existing fault-tolerance protocols in the literature are not suitable for implementing this approach.We then present a suitable protocol, which is implemented by a co-ordination layer that sits between the application program and the MPI library. We show how this protocol can be used with a precompiler that instruments C/MPI programs to save application and MPI library state. An advantage of our approach is that it is independent of the MPI implementation. We present experimental results that argue that the overhead of using our system can be small.

Keywords

This publication has 7 references indexed in Scilit:

Collective operations in application-level fault-tolerant MPI
Published by Association for Computing Machinery (ACM) ,2003
A network-failure-tolerant message-passing system for terascale clusters
Published by Association for Computing Machinery (ACM) ,2002
On scalable and efficient distributed failure detectors
Published by Association for Computing Machinery (ACM) ,2001
Application Level Fault Tolerance in Heterogeneous Networks of Workstations
Journal of Parallel and Distributed Computing, 1997
Manetho: transparent roll back-recovery with low overhead, limited rollback, and fast output commit
IEEE Transactions on Computers, 1992
Transparent optimistic rollback recovery
ACM SIGOPS Operating Systems Review, 1991
Distributed snapshots
ACM Transactions on Computer Systems, 1985