Implementation of online distributed system-level diagnosis theory
- 1 May 1992
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Computers
- Vol. 41 (5) , 616-626
- https://doi.org/10.1109/12.142688
Abstract
There has been significant theoretical research in the area of system-level diagnosis. This paper documents the first practical application and implementation of on-line distributed system- level diagnosis theory. Proven distributed diagnosis algorithms are shown to be impractical in real systems due to high resource requirements. A new distributed system-level diagnosis algo- rithm, called Adaptive DSD , is shown to minimize network resources and has resulted in a practi- cal implementation. Adaptive DSD assumes a distributed network, in which network nodes can test other nodes and determine them to be faulty or fault-free. Tests are issued from each node adaptively, and depend on the fault situation of the network. Test result reports are generated from test results and forwarded between nodes in the network. Adaptive DSD is proven correct in that each fault-free node reaches an accurate independent diagnosis of the fault conditions of the remaining nodes. No restriction is placed on the number of faulty nodes, any fault situation with any number of faulty nodes is diagnosed correctly. The Adaptive DSD algorithm is implemented and currently monitors over 200 workstations in the Electrical and Computer Engineering Department at Carnegie Mellon University. The algo- rithm has executed continuously for the past year, even though no single workstation has remained fault-free over that period. Key results of this paper include: an overview of previous distributed system-level diagnosis algorithms, the specification of a new adaptive distributed sys- tem-level diagnosis algorithm, its comparison to previous centralized adaptive and distributed non-adaptive schemes, its application to an actual distributed network environment, and the experimentation within that environment.Keywords
This publication has 13 references indexed in Scilit:
- Dynamic testing strategy for distributed systemsIEEE Transactions on Computers, 1989
- Hybrid fault diagnosability with unreliable communication linksIEEE Transactions on Computers, 1988
- System-Level Fault Diagnosis: A surveyMicroprocessing and Microprogramming, 1987
- Internet Standard Subnetting ProcedurePublished by RFC Editor ,1985
- A Diagnosis Algorithm for Distributed Computing Systems with Dynamic Failure and RepairIEEE Transactions on Computers, 1984
- On Adaptive System DiagnosisIEEE Transactions on Computers, 1984
- Internet ProtocolPublished by RFC Editor ,1981
- Distributed fault-tolerance for large multiprocessor systemsPublished by Association for Computing Machinery (ACM) ,1980
- Characterization of Connection Assignment of Diagnosable SystemsIEEE Transactions on Computers, 1974
- On the Connection Assignment Problem of Diagnosable SystemsIEEE Transactions on Electronic Computers, 1967