Parallel processing on networks of workstations: a fault-tolerant, high performance approach

19 November 2002

conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

No. 10636927,p. 467-474
https://doi.org/10.1109/icdcs.1995.500052

Abstract

One of the most sought after software innovation of this decade is the construction of systems using off-the-shelf-workstations that actually deliver and even surpass, the power and reliability of supercomputers. Using completely novel techniques: eager scheduling, evasive memory layouts and dispersed data management it is possible to build an execution environment for parallel programs on workstation networks. These techniques were originally developed in a theoretical framework for an abstract machine which models a shared memory asynchronous multiprocessor. The network of workstations platform presents an inherently asynchronous environment for the execution of our parallel program. This gives rise to substantial problems of correctness of the computation and of proper automatic load balancing of the work amongst the processors, so that a slow processor will not hold up the total computation. A limiting case of asynchrony is when a processor becomes infinitely slow, i.e. fails. Our methodology copes with all these problems, as well as with memory failures. An interesting feature of this system is that it is neither a fault-tolerant system extended for parallel processing nor is it parallel processing system extended for fault tolerance. The same novel mechanisms ensure both properties.

Keywords

This publication has 14 references indexed in Scilit:

Highly efficient asynchronous execution of large-grained parallel programs
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
A concurrent programming environment for memory-mapped persistent object systems
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Fault tolerance in parallel implementations of functional languages
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
RAID: high-performance, reliable secondary storage
ACM Computing Surveys, 1994
The Clouds distributed operating system
Computer, 1991
PVM: A framework for parallel distributed computing
Concurrency: Practice and Experience, 1990
Orca
ACM SIGPLAN Notices, 1990
The Amber system: parallel programming on a network of multiprocessors
Published by Association for Computing Machinery (ACM) ,1989
Linda in context
Communications of the ACM, 1989
Replicated distributed programs
ACM SIGOPS Operating Systems Review, 1985