Building and Using a Fault-Tolerant MPI Implementation
- 1 August 2004
- journal article
- research article
- Published by SAGE Publications in The International Journal of High Performance Computing Applications
- Vol. 18 (3) , 353-361
- https://doi.org/10.1177/1094342004046052
Abstract
In this paper we discuss the design and use of a fault-tolerant MPI (FT-MPI) that handles process failures in a way beyond that of the original MPI static process model. FTMPI allows the semantics and associated modes of failures to be explicitly controlled by an application via a modified functionality within the standard MPI 1.2 API. Given is an overview of the FT-MPI semantics, architecture design, example usage and sample applications. A short discussion is given on the consequences of designing a fault-tolerant MPI both in terms of how such an implementation handles failures at multiple levels internally as well as how existing applications can use new features while still remaining within the MPI standard.Keywords
This publication has 5 references indexed in Scilit:
- The GrADS Project: Software Support for High-Level Grid Application DevelopmentThe International Journal of High Performance Computing Applications, 2001
- Numerical Libraries and the GridThe International Journal of High Performance Computing Applications, 2001
- HARNESS and fault tolerant MPIParallel Computing, 2001
- MPI-FT: PORTABLE FAULT TOLERANCE SCHEME FOR MPIParallel Processing Letters, 2000
- HARNESS: a next generation distributed virtual machineFuture Generation Computer Systems, 1999