Node-covering, error-correcting codes and multiprocessors with very high average fault tolerance
- 1 January 1997
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Computers
- Vol. 46 (9) , 997-1015
- https://doi.org/10.1109/12.620481
Abstract
Structural fault tolerance (SFT) is the ability of a multiprocessor to reconfigure around faulty processors or links in order to preserve its original processor interconnection structure. In this paper, we focus on the design of SFT multiprocessors that have low switch and link overheads, but can tolerate a very large number of processor faults on the average. Most previous work has concentrated on deterministic k-fault-tolerant (k-FT) designs in which exactly k spare processors and some spare switches and links are added to construct multiprocessors that can tolerate any k processor faults. However, after k faults are reconfigured around, much of the extra links and switches can remain unutilized. It is possible within the basic node-covering framework, which was introduced by Dutt and Hayes as an efficient k-FT design method, to design FT multiprocessors that have the same amount of switches and links as, say, a two-FT deterministic design, but have s spare processors, where $s \gg 2,$ so that, on the average, k = 驴(s) (k驴s) processor failures can be reconfigured around. Such designs utilize the spare link and switch capacity very efficiently, and are called probabilistic FT designs. An elegant and powerful method to construct covering graphs or CG's, which are key to obtaining the probabilistic FT designs, is to use linear error-correcting codes (ECCs). We show how to construct probabilistic designs with very high average fault tolerance but low wiring and switch overhead using ECCs like the 2D-parity, full-two, 3D-parity, and full-three codes. This design methodology is applicable to any multiprocessor interconnection topology and the resulting FT designs have the same node degree as the non-FT target topology. We also analyze the deterministic fault tolerance for these designs and develop efficient layout strategies for them. Finally, we compare the proposed probabilistic designs to some of the best deterministic and probabilistic designs proposed in the past, and show that our designs can meet a given mean-time-to-failure (MTTF) specification at much lower hardware costs (switch complexity, redundant wiring area, and spare-processor overhead) than previous designs. Further, for a given number of spare processors, our designs have close-to-optimal reconfigurabilities that are much better than those of previous probabilistic designs.
Keywords
This publication has 21 references indexed in Scilit:
- An automorphic approach to the design of fault-tolerant multiprocessorsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- Fast polylog-time reconfiguration of structurally fault-tolerant multiprocessorsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Node covering, error correcting codes and multiprocessors with very high average fault tolerancePublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- A Local-Sparing Design Methodology for Fault-Tolerant MultiprocessorsaaThis research was supported in part by the Office of Naval Research under Contract N00014-85K0531 and N00014-90J1860, and in part by NSF Grants MIP-9210049 and MIP-9200526.Computers & Mathematics with Applications, 1997
- Some practical issues in the design of fault-tolerant multiprocessorsIEEE Transactions on Computers, 1992
- Designing fault-tolerant systems using automorphismsJournal of Parallel and Distributed Computing, 1991
- On designing and reconfiguring k-fault-tolerant tree architecturesIEEE Transactions on Computers, 1990
- Failure correction techniques for large disk arraysPublished by Association for Computing Machinery (ACM) ,1989
- A Graph Model for Fault-Tolerant Computing SystemsIEEE Transactions on Computers, 1976
- An $n^{5/2} $ Algorithm for Maximum Matchings in Bipartite GraphsSIAM Journal on Computing, 1973