The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing
- 1 November 2005
- journal article
- Published by SAGE Publications in The International Journal of High Performance Computing Applications
- Vol. 19 (4) , 479-493
- https://doi.org/10.1177/1094342005056139
Abstract
As high performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback recovery for MPI-based parallel applications. Our approach integrates the Berkeley Lab BLCR kernel-level process checkpoint system with the LAM implementation of MPI through a defined checkpoint/restart interface. Checkpointing is transparent to the application, allowing the system to be used for cluster maintenance and scheduling reasons as well as for fault tolerance. Experimental results show negligible communication performance impact due to the incorporation of the checkpoint support capabilities into LAM/MPI.Keywords
This publication has 14 references indexed in Scilit:
- A user-level checkpointing library for POSIX threads programsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- HARNESS and fault tolerant MPIParallel Computing, 2001
- System Utilization Benchmark on the Cray T3E and IBM SPPublished by Springer Nature ,2000
- A high-performance, portable implementation of the MPI message passing interface standardParallel Computing, 1996
- MPI-2: Extending the message-passing interfacePublished by Springer Nature ,1996
- Managing checkpoints for parallel programsPublished by Springer Nature ,1996
- Checkpoint space reclamation for uncoordinated checkpointing in message-passing systemsIEEE Transactions on Parallel and Distributed Systems, 1995
- Rollback recovery in distributed systems using loosely synchronized clocksIEEE Transactions on Parallel and Distributed Systems, 1992
- Distributed snapshotsACM Transactions on Computer Systems, 1985
- System structure for software fault toleranceIEEE Transactions on Software Engineering, 1975