The broadcast comparison model for on-line fault diagnosis in multicomputer systems: theory and implementation
- 1 May 1999
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Computers
- Vol. 48 (5) , 470-493
- https://doi.org/10.1109/12.769431
Abstract
This paper describes a new comparison-based model for distributed fault diagnosis in multicomputer systems with a weak reliable broadcast capability. The classical problems of diagnosability and diagnosis are both considered under this broadcast comparison model. A characterization of diagnosable systems is given, which leads to a polynomial-time diagnosability algorithm. A polynomial-time diagnosis algorithm for $t$-diagnosable systems is also given. A variation of this algorithm, which allows dynamic fault occurrence and incomplete diagnostic information, has been implemented in the COmmon Spaceborne Multicomputer Operating System (COSMOS). Results produced using a simulator for the JPL MAX multicomputer system running COSMOS show that the algorithm diagnoses all fault situations with low latency and very little overhead. These simulations demonstrate the practicality of the proposed diagnosis model and algorithm for multicomputer systems having weak reliable broadcast. This includes systems with fault-tolerant hardware for broadcast, as well as those where reliable broadcast is implemented in software.Keywords
This publication has 25 references indexed in Scilit:
- Checkpoint/rollback in a distributed system using coarse-grained dataflowPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Practical application and implementation of distributed system-level diagnosis theoryPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- On self-diagnosable multiprocessor systems: diagnosis by the comparison approachIEEE Transactions on Computers, 1992
- The Delta-4 approach to dependability in open distributed computing systemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,1988
- A Diagnosis Algorithm for the BGM System Level Fault ModelIEEE Transactions on Computers, 1984
- A Polynomial Time Algorithm For Fault DiagnosabilityPublished by Institute of Electrical and Electronics Engineers (IEEE) ,1984
- Fail-stop processorsACM Transactions on Computer Systems, 1983
- Schemes for fault-tolerant computing: A comparison of modularly redundant and t-diagnosable systemsInformation and Control, 1981
- A comparison connection assignment for diagnosis of multiprocessor systemsPublished by Association for Computing Machinery (ACM) ,1980
- On the Connection Assignment Problem of Diagnosable SystemsIEEE Transactions on Electronic Computers, 1967