Efficient distributed diagnosis in the presence of random faults

30 December 2002

conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

No. 07313071,p. 462-469
https://doi.org/10.1109/ftcs.1993.627349

Abstract

A probabilistic setting for distributed system-level diagnosis is studied. Processors in a multiprocessor system fail independently with constant probability 0 < p < 1/2. They can test their neighbors and a fault-free processor has probability 0 < q < 1 of detecting a fault of a failed processor in a single test. The aim of distributed diagnosis is for every fault-free processor to know the status of every other processor. This is achieved by communicating test results throughout the system. Faulty processors can behave arbitrarily, even maliciously, both as testers and as message transmitters. The author proposes a diagnosis scheme using O(n) time, O(n log n) tests, O(n/sup 2/) message bits and working correctly with probability converging to 1, as n grows. It is shown that each of these three characteristics has the lowest possible order of magnitude.

Keywords

This publication has 11 references indexed in Scilit:

Almost certain diagnosis for intermittently faulty systems
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2003
A probabilistic method for fault diagnosis of multiprocessor systems
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2003
On minimizing testing rounds for fault identification
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2003
Distributed probabilistic fault diagnosis for multiprocessor systems
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
A guided tour of chernoff bounds
Information Processing Letters, 1990
Almost Sure Fault Tolerance in Random Graphs
SIAM Journal on Computing, 1987
The Comparison Approach to Multiprocessor Fault Diagnosis
IEEE Transactions on Computers, 1987
Distributed fault-tolerance for large multiprocessor systems
Published by Association for Computing Machinery (ACM) ,1980
On Models for Diagnosable Systems and Probabilistic Fault Diagnosis
IEEE Transactions on Computers, 1976
On the Connection Assignment Problem of Diagnosable Systems
IEEE Transactions on Electronic Computers, 1967