Failure Resilient Heterogeneous Parallel Computing Across Multidomain Clusters
- 1 May 2005
- journal article
- research article
- Published by SAGE Publications in The International Journal of High Performance Computing Applications
- Vol. 19 (2) , 143-155
- https://doi.org/10.1177/1094342005054260
Abstract
We propose lightweight middleware solutions that facilitate and simplify the execution of failure-resilient Message Passing Interface (MPI) programs across multidomain clusters. The system described in this paper leverages H2O, a distributed metacomputing framework, to route MPI message passing across heterogeneous aggregates located in different administrative or network domains. MPI communication is aided by a specially written H2O pluglet; messages that are destined for remote sites are intercepted and transparently forwarded to their final destinations. We demonstrate that the proposed technique is indeed effective in enabling communication by MPI programs across distinct clusters and across firewalls. Only marginally lowered performance was observed in our tests, and we believe the substantially increased functionality would compensate for this overhead in most situations. In addition to enabling multicluster communications, we note that with the increasing size and distribution of metacomputing environments, fault tolerance aspects become critically important. We argue that the fault tolerance model proposed by FT-MPI fits well in geographically distributed environments, even though its current implementation is confined to a single administrative domain. We describe extensions to overcome these limitations by combining FT-MPI with the H2O framework. Our holistic approach allows users to run fault-tolerant MPI programs on heterogeneous, geographically distributed shared machines, without sacrificing performance and with minimal involvement of resource providers.Keywords
This publication has 12 references indexed in Scilit:
- TOWARDS SELF-ORGANIZING DISTRIBUTED COMPUTING FRAMEWORKS: THE H2O APPROACHParallel Processing Letters, 2003
- MPICH-G2: A Grid-enabled implementation of the Message Passing InterfaceJournal of Parallel and Distributed Computing, 2003
- NetSolve: Past, Present, and Future – A Look at a Grid Enabled ServerPublished by Wiley ,2003
- Grids: The Top Ten QuestionsScientific Programming, 2002
- Network performance-aware collective communication for clustered wide-area systemsParallel Computing, 2001
- HARNESS and fault tolerant MPIParallel Computing, 2001
- MPI-FT: PORTABLE FAULT TOLERANCE SCHEME FOR MPIParallel Processing Letters, 2000
- Globus: a Metacomputing Infrastructure ToolkitThe International Journal of Supercomputer Applications and High Performance Computing, 1997
- A high-performance, portable implementation of the MPI message passing interface standardParallel Computing, 1996
- Manetho: transparent roll back-recovery with low overhead, limited rollback, and fast output commitIEEE Transactions on Computers, 1992