Algorithm-based fault tolerance on a hypercube multiprocessor
- 1 January 1990
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Computers
- Vol. 39 (9) , 1132-1145
- https://doi.org/10.1109/12.57055
Abstract
The design of fault-tolerant hypercube multiprocessor architecture is discussed. The authors propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based error detection. System-level error detection mechanisms have been implemented for three parallel applications on a 16-processor Intel iPSC hypercube multiprocessor: matrix multiplication, Gaussian elimination, and fast Fourier transform. Schemes for other applications are under development. Extensive studies have been done of error coverage of the system-level error detection schemes in the presence of finite-precision arithmetic, which affects the system-level encodings. Two reconfiguration schemes are proposed that allow the authors to isolate and replace faulty processors with spare processors.Keywords
This publication has 19 references indexed in Scilit:
- Algorithm-based fault detection for signal processing applicationsIEEE Transactions on Computers, 1990
- Fault-tolerant FFT networksIEEE Transactions on Computers, 1988
- A fault-tolerant FFT processorIEEE Transactions on Computers, 1988
- Fault-Tolerant Systems For The Computation Of Eigenvalues And Singular ValuesPublished by SPIE-Intl Soc Optical Eng ,1986
- Algorithm-based Fault Tolerance for Parallel Matrix Equation SolversPublished by SPIE-Intl Soc Optical Eng ,1986
- Fault-Tolerant Multiprocessor Link and Bus Network ArchitecturesIEEE Transactions on Computers, 1985
- Fault-Tolerant Computing—Concepts and ExamplesIEEE Transactions on Computers, 1984
- Fault-Tolerant Matrix Operations On Multiple Processor Systems Using Weighted ChecksumsPublished by SPIE-Intl Soc Optical Eng ,1984
- Algorithm-Based Fault Tolerance for Matrix OperationsIEEE Transactions on Computers, 1984
- Fault-secure algorithms for multiple-processor systemsPublished by Association for Computing Machinery (ACM) ,1984