An efficient and scalable approach for implementing fault-tolerant DSM architectures
- 1 May 2000
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Computers
- Vol. 49 (5) , 414-430
- https://doi.org/10.1109/12.859537
Abstract
Distributed Shared Memory (dsm) architectures are attractive to execute high performance parallel applications. Made up of a large number of components, these architectures have however a high probability of failure. We propose a protocol to tolerate node failures in cache-based dsm architectures. The proposed solution is based on backward error recovery and consists of an extension to the existing coherence protocol to manage data used by processors for the computation and recovery data used for fault tolerance. This approach can be applied to both Cache Only Memory Architectures (coma) and Shared Virtual Memory (svm) systems. The implementation of the protocol in a coma architecture has been evaluated by simulation. The protocol has also been implemented in an svm system on a network of workstations. Both simulation results and measurements show that our solution is efficient and scalable.Keywords
This publication has 32 references indexed in Scilit:
- The performance of consistent checkpointingPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- Tolerating node failures in cache only memory architecturesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- The Stanford FLASH multiprocessorPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- COMAPublished by Association for Computing Machinery (ACM) ,1996
- TreadMarks: shared memory computing on networks of workstationsComputer, 1996
- Abstract execution: A technique for efficiently tracing programsSoftware: Practice and Experience, 1990
- Directory-based cache coherence in large-scale multiprocessorsComputer, 1990
- Error recovery in shared memory multiprocessors using private cachesIEEE Transactions on Parallel and Distributed Systems, 1990
- Sequoia: a fault-tolerant tightly coupled multiprocessor for transaction processingComputer, 1988
- The CHORUS Distributed Operating System: Some Design IssuesPublished by Springer Nature ,1987