Hardware fault containment in scalable shared-memory multiprocessors

1 May 1997

proceedings article
Published by Association for Computing Machinery (ACM)

Vol. 25 (2) , 73-84
https://doi.org/10.1145/264107.264141

Abstract

Current shared-memory multiprocessors are inherently vulnerable to faults: any significant hardware or system software fault causes the entire system to fail. Unless provisions are made to limit the impact of faults, users will perceive a decrease in reliability when they entrust their applications to larger machines. This paper shows that fault containment techniques can be effectively applied to scalable shared-memory multiprocessors to reduce the reliability problems created by increased machine size.The primary goal of our approach is to leave normal-mode performance unaffected. Rather than using expensive fault-tolerance techniques to mask the effects of data and resource loss, our strategy is based on limiting the damage caused by faults to only a portion of the machine. After a hardware fault, we run a distributed recovery algorithm that allows normal operation to be resumed in the functioning parts of the machine.Our approach is implemented in the Stanford FLASH multiprocessor. Using a detailed hardware simulator, we have performed a number of fault injection experiments on a FLASH system running Hive, an operating system designed to support fault containment. The results we report validate our approach and show that in conjunction with an operating system like Hive, we can improve the reliability seen by unmodified applications without substantial performance cost. Simulation results suggest that our algorithms scale well for systems up to 128 processors.

This publication has 15 references indexed in Scilit:

The SGI Origin
Published by Association for Computing Machinery (ACM) ,1997
The Mercury Interconnect Architecture
Published by Association for Computing Machinery (ACM) ,1997
Implementing efficient fault containment for multiprocessors
Communications of the ACM, 1996
STiNG
Published by Association for Computing Machinery (ACM) ,1996
COMA
Published by Association for Computing Machinery (ACM) ,1996
The Mips R10000 superscalar microprocessor
IEEE Micro, 1996
Complete computer system simulation: the SimOS approach
IEEE Parallel & Distributed Technology: Systems & Applications, 1995
Hive
Published by Association for Computing Machinery (ACM) ,1995
Fault-tolerant computing: fundamental concepts
Computer, 1990
Efficient synchronization primitives for large-scale cache-coherent multiprocessors
Published by Association for Computing Machinery (ACM) ,1989