Algorithm-Based Fault Tolerance for Matrix Operations

1 June 1984

journal article
Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Computers

Vol. C-33 (6) , 518-528
https://doi.org/10.1109/tc.1984.1676475

Abstract

The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple copies of low-cost processors to provide a large amount of computational capability for a small cost. In addition to achieving high performance, high reliability is also important to ensure that the results of long computations are valid. This paper proposes a novel system-level method of achieving high reliability, called algorithm-based fault tolerance. The technique encodes data at a high level, and algorithms are designed to operate on encoded data and produce encoded output data. The computation tasks within an algorithm are appropriately distributed among multiple computation units for fault tolerance. The technique is applied to matrix compomations which form the heart of many computation-intensive tasks. Algorithm-based fault tolerance schemes are proposed to detect and correct errors when matrix operations such as addition, multiplication, scalar product, LU-decomposition, and transposition are performed using multiple processor systems. The method proposed can detect and correct any failure within a single processor in a multiple processor system. The number of processors needed to just detect errors in matrix multiplication is also studied.

Keywords

This publication has 10 references indexed in Scilit:

Watchdog Processors and Structural Integrity Checking
IEEE Transactions on Computers, 1982
Concurrent Error Detection in ALU's by Recomputing with Shifted Operands
IEEE Transactions on Computers, 1982
Design of a Massively Parallel Processor
IEEE Transactions on Computers, 1980
Design of Self-Checking MOS-LSI Circuits: Application to a Four-Bit Microprocessor
IEEE Transactions on Computers, 1980
Fault Detection Capabilities of Alternating Logic
IEEE Transactions on Computers, 1978
Error Correction by Alternate-Data Retry
IEEE Transactions on Computers, 1978
Optimal Rectangular Code for High Density Magnetic Tapes
IBM Journal of Research and Development, 1974
A method of avoiding loss of accuracy in nodal analysis
Proceedings of the IEEE, 1967
Error-free Coding
Transactions of the IRE Professional Group on Information Theory, 1954
Error Detecting and Error Correcting Codes
Bell System Technical Journal, 1950