Algorithm-Based Fault Tolerance for Matrix Operations
- 1 June 1984
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Computers
- Vol. C-33 (6) , 518-528
- https://doi.org/10.1109/tc.1984.1676475
Abstract
The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple copies of low-cost processors to provide a large amount of computational capability for a small cost. In addition to achieving high performance, high reliability is also important to ensure that the results of long computations are valid. This paper proposes a novel system-level method of achieving high reliability, called algorithm-based fault tolerance. The technique encodes data at a high level, and algorithms are designed to operate on encoded data and produce encoded output data. The computation tasks within an algorithm are appropriately distributed among multiple computation units for fault tolerance. The technique is applied to matrix compomations which form the heart of many computation-intensive tasks. Algorithm-based fault tolerance schemes are proposed to detect and correct errors when matrix operations such as addition, multiplication, scalar product, LU-decomposition, and transposition are performed using multiple processor systems. The method proposed can detect and correct any failure within a single processor in a multiple processor system. The number of processors needed to just detect errors in matrix multiplication is also studied.Keywords
This publication has 10 references indexed in Scilit:
- Watchdog Processors and Structural Integrity CheckingIEEE Transactions on Computers, 1982
- Concurrent Error Detection in ALU's by Recomputing with Shifted OperandsIEEE Transactions on Computers, 1982
- Design of a Massively Parallel ProcessorIEEE Transactions on Computers, 1980
- Design of Self-Checking MOS-LSI Circuits: Application to a Four-Bit MicroprocessorIEEE Transactions on Computers, 1980
- Fault Detection Capabilities of Alternating LogicIEEE Transactions on Computers, 1978
- Error Correction by Alternate-Data RetryIEEE Transactions on Computers, 1978
- Optimal Rectangular Code for High Density Magnetic TapesIBM Journal of Research and Development, 1974
- A method of avoiding loss of accuracy in nodal analysisProceedings of the IEEE, 1967
- Error-free CodingTransactions of the IRE Professional Group on Information Theory, 1954
- Error Detecting and Error Correcting CodesBell System Technical Journal, 1950