Hardware fault containment in scalable shared-memory multiprocessors
- 1 May 1997
- proceedings article
- Published by Association for Computing Machinery (ACM)
- Vol. 25 (2) , 73-84
- https://doi.org/10.1145/264107.264141
Abstract
Current shared-memory multiprocessors are inherently vulnerable to faults: any significant hardware or system software fault causes the entire system to fail. Unless provisions are made to limit the impact of faults, users will perceive a decrease in reliability when they entrust their applications to larger machines. This paper shows that fault containment techniques can be effectively applied to scalable shared-memory multiprocessors to reduce the reliability problems created by increased machine size.The primary goal of our approach is to leave normal-mode performance unaffected. Rather than using expensive fault-tolerance techniques to mask the effects of data and resource loss, our strategy is based on limiting the damage caused by faults to only a portion of the machine. After a hardware fault, we run a distributed recovery algorithm that allows normal operation to be resumed in the functioning parts of the machine.Our approach is implemented in the Stanford FLASH multiprocessor. Using a detailed hardware simulator, we have performed a number of fault injection experiments on a FLASH system running Hive, an operating system designed to support fault containment. The results we report validate our approach and show that in conjunction with an operating system like Hive, we can improve the reliability seen by unmodified applications without substantial performance cost. Simulation results suggest that our algorithms scale well for systems up to 128 processors.This publication has 15 references indexed in Scilit:
- The SGI OriginPublished by Association for Computing Machinery (ACM) ,1997
- The Mercury Interconnect ArchitecturePublished by Association for Computing Machinery (ACM) ,1997
- Implementing efficient fault containment for multiprocessorsCommunications of the ACM, 1996
- STiNGPublished by Association for Computing Machinery (ACM) ,1996
- COMAPublished by Association for Computing Machinery (ACM) ,1996
- The Mips R10000 superscalar microprocessorIEEE Micro, 1996
- Complete computer system simulation: the SimOS approachIEEE Parallel & Distributed Technology: Systems & Applications, 1995
- HivePublished by Association for Computing Machinery (ACM) ,1995
- Fault-tolerant computing: fundamental conceptsComputer, 1990
- Efficient synchronization primitives for large-scale cache-coherent multiprocessorsPublished by Association for Computing Machinery (ACM) ,1989