Group membership failure detection: a simple protocol and its probabilistic analysis

1 September 1999

journal article
Published by IOP Publishing in Distributed Systems Engineering

Vol. 6 (3) , 95-102
https://doi.org/10.1088/0967-1846/6/3/301

Abstract

A group membership failure (in short, a group failure) occurs when one of the group members crashes. A group failure detection protocol has to inform all the non-crashed members of the group that this group entity has crashed. Ideally, such a protocol should be live (if a process crashes, then the group failure has to be detected) and safe (if a group failure is claimed, then at least one process has crashed). Unreliable asynchronous distributed systems are characterized by the impossibility for a process to get an accurate view of the system state. Consequently, the design of a group failure detection protocol that is both safe and live is a problem that cannot be solved in all runs of an asynchronous distributed system. This paper analyses a group failure detection protocol whose design naturally ensures its liveness. We show that by appropriately tuning some of its duration-related parameters, the safety property can be guaranteed with a probability as close to one as desired. This analysis shows that, in real distributed systems, it is possible to achieve failure detection with a negligible probability of wrong suspicions.

Keywords

This publication has 4 references indexed in Scilit:

The timed asynchronous distributed system model
IEEE Transactions on Parallel and Distributed Systems, 1999
Group communication
Communications of the ACM, 1996
Unreliable failure detectors for reliable distributed systems
Journal of the ACM, 1996
Impossibility of distributed consensus with one faulty process
Journal of the ACM, 1985