Algorithm-based fault tolerance on a hypercube multiprocessor

1 January 1990

journal article
Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Computers

Vol. 39 (9) , 1132-1145
https://doi.org/10.1109/12.57055

Abstract

The design of fault-tolerant hypercube multiprocessor architecture is discussed. The authors propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based error detection. System-level error detection mechanisms have been implemented for three parallel applications on a 16-processor Intel iPSC hypercube multiprocessor: matrix multiplication, Gaussian elimination, and fast Fourier transform. Schemes for other applications are under development. Extensive studies have been done of error coverage of the system-level error detection schemes in the presence of finite-precision arithmetic, which affects the system-level encodings. Two reconfiguration schemes are proposed that allow the authors to isolate and replace faulty processors with spare processors.

Keywords

This publication has 19 references indexed in Scilit:

Algorithm-based fault detection for signal processing applications
IEEE Transactions on Computers, 1990
Fault-tolerant FFT networks
IEEE Transactions on Computers, 1988
A fault-tolerant FFT processor
IEEE Transactions on Computers, 1988
Fault-Tolerant Systems For The Computation Of Eigenvalues And Singular Values
Published by SPIE-Intl Soc Optical Eng ,1986
Algorithm-based Fault Tolerance for Parallel Matrix Equation Solvers
Published by SPIE-Intl Soc Optical Eng ,1986
Fault-Tolerant Multiprocessor Link and Bus Network Architectures
IEEE Transactions on Computers, 1985
Fault-Tolerant Computing—Concepts and Examples
IEEE Transactions on Computers, 1984
Fault-Tolerant Matrix Operations On Multiple Processor Systems Using Weighted Checksums
Published by SPIE-Intl Soc Optical Eng ,1984
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers, 1984
Fault-secure algorithms for multiple-processor systems
Published by Association for Computing Machinery (ACM) ,1984