Fault-tolerance, malleability and migration for divide-and-conquer applications on the grid

Abstract
Grid applications have to cope with dynamically changing computing resources as machines may crash or be claimed by other, higher-priority applications. In this paper, we propose a mechanism that enables fault-tolerance, malleability (e.g. the ability to cope with a dynamically changing number of processors) and migration for divide-and-conquer applications on the grid. The novelty of our approach is restructuring the computation tree, which eliminates redundant computation and salvages partial results computed by the processors leaving the computation. This enables the applications to adapt to dynamically changing numbers of processors and to migrate the computation without loss of work. Our mechanism is easy to implement and deploy in grid environment. The overhead it incurs is close to zero. We have implemented our mechanism in the Satin system. We have evaluated the performance of our system on the DAS-2 wide-area system and on the testbed of the European GridLab project.

This publication has 15 references indexed in Scilit: