Fault-tolerance, malleability and migration for divide-and-conquer applications on the grid
- 1 January 2005
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- No. 15302075,p. 10 pp.
- https://doi.org/10.1109/ipdps.2005.224
Abstract
Grid applications have to cope with dynamically changing computing resources as machines may crash or be claimed by other, higher-priority applications. In this paper, we propose a mechanism that enables fault-tolerance, malleability (e.g. the ability to cope with a dynamically changing number of processors) and migration for divide-and-conquer applications on the grid. The novelty of our approach is restructuring the computation tree, which eliminates redundant computation and salvages partial results computed by the processors leaving the computation. This enables the applications to adapt to dynamically changing numbers of processors and to migrate the computation without loss of work. Our mechanism is easy to implement and deploy in grid environment. The overhead it incurs is close to zero. We have implemented our mechanism in the Satin system. We have evaluated the performance of our system on the DAS-2 wide-area system and on the testbed of the European GridLab project.Keywords
This publication has 15 references indexed in Scilit:
- Enabling Applications on the Grid: A Gridlab OverviewThe International Journal of High Performance Computing Applications, 2003
- SRS: A FRAMEWORK FOR DEVELOPING MALLEABLE AND MIGRATABLE PARALLEL APPLICATIONS FOR DISTRIBUTED SYSTEMSParallel Processing Letters, 2003
- The Cactus Code: a problem solving environment for the gridPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- An enabling framework for master-worker applications on the Computational GridPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Experiments with Migration of Message-Passing TasksPublished by Springer Nature ,2000
- Charlotte: Metacomputing on the WebFuture Generation Computer Systems, 1999
- A TLASPublished by Association for Computing Machinery (ACM) ,1996
- Cilk: An Efficient Multithreaded Runtime SystemJournal of Parallel and Distributed Computing, 1996
- Myrinet: a gigabit-per-second local area networkIEEE Micro, 1995
- DIB—a distributed implementation of backtrackingACM Transactions on Programming Languages and Systems, 1987