Perfect failure detection in timed asynchronous systems

19 February 2003

journal article
Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Computers

Vol. 52 (2) , 99-112
https://doi.org/10.1109/tc.2003.1176979

Abstract

Perfect failure detectors can correctly decide whether a computer is crashed. However, it is impossible to implement a perfect failure detector in purely asynchronous systems. We show how to enforce perfect failure detection in timed asynchronous systems with hardware watchdogs. The two main system model assumptions are: 1) each computer can measure time intervals with a known maximum error and 2) each computer has a watchdog that crashes the computer unless the watchdog is periodically updated. We have implemented a system that satisfies both assumptions using a combination of off-the-shelf software and hardware. To implement a perfect failure detector for process crash failures, we show that, in some systems, a hardware watchdog is actually not necessary.

Keywords

This publication has 15 references indexed in Scilit:

Approximate real-time clocks for scheduled events
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2003
Simulating fail-stop in asynchronous distributed systems
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Rejuvenation and failure detection in partitionable systems
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
The timed asynchronous distributed system model
IEEE Transactions on Parallel and Distributed Systems, 1999
Fail-aware datagram service
IEE Proceedings - Software, 1999
Unreliable failure detectors for reliable distributed systems
Journal of the ACM, 1996
Understanding the limitations of causally and totally ordered communication
Published by Association for Computing Machinery (ACM) ,1993
Leases: an efficient fault-tolerant mechanism for distributed file cache consistency
Published by Association for Computing Machinery (ACM) ,1989
Impossibility of distributed consensus with one faulty process
Journal of the ACM, 1985
Distributed snapshots
ACM Transactions on Computer Systems, 1985