Abstract
A probabilistic setting for distributed system-level diagnosis is studied. Processors in a multiprocessor system fail independently with constant probability 0 < p < 1/2. They can test their neighbors and a fault-free processor has probability 0 < q < 1 of detecting a fault of a failed processor in a single test. The aim of distributed diagnosis is for every fault-free processor to know the status of every other processor. This is achieved by communicating test results throughout the system. Faulty processors can behave arbitrarily, even maliciously, both as testers and as message transmitters. The author proposes a diagnosis scheme using O(n) time, O(n log n) tests, O(n/sup 2/) message bits and working correctly with probability converging to 1, as n grows. It is shown that each of these three characteristics has the lowest possible order of magnitude.

This publication has 11 references indexed in Scilit: