An efficient and scalable approach for implementing fault-tolerant DSM architectures

1 May 2000

journal article
Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Computers

Vol. 49 (5) , 414-430
https://doi.org/10.1109/12.859537

Abstract

Distributed Shared Memory (dsm) architectures are attractive to execute high performance parallel applications. Made up of a large number of components, these architectures have however a high probability of failure. We propose a protocol to tolerate node failures in cache-based dsm architectures. The proposed solution is based on backward error recovery and consists of an extension to the existing coherence protocol to manage data used by processors for the computation and recovery data used for fault tolerance. This approach can be applied to both Cache Only Memory Architectures (coma) and Shared Virtual Memory (svm) systems. The implementation of the protocol in a coma architecture has been evaluated by simulation. The protocol has also been implemented in an svm system on a network of workstations. Both simulation results and measurements show that our solution is efficient and scalable.

Keywords

This publication has 32 references indexed in Scilit:

The performance of consistent checkpointing
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2003
Tolerating node failures in cache only memory architectures
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
The Stanford FLASH multiprocessor
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
COMA
Published by Association for Computing Machinery (ACM) ,1996
TreadMarks: shared memory computing on networks of workstations
Computer, 1996
Abstract execution: A technique for efficiently tracing programs
Software: Practice and Experience, 1990
Directory-based cache coherence in large-scale multiprocessors
Computer, 1990
Error recovery in shared memory multiprocessors using private caches
IEEE Transactions on Parallel and Distributed Systems, 1990
Sequoia: a fault-tolerant tightly coupled multiprocessor for transaction processing
Computer, 1988
The CHORUS Distributed Operating System: Some Design Issues
Published by Springer Nature ,1987