A fault detection service for wide area distributed computations
- 27 November 2002
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- No. 10828907,p. 268-278
- https://doi.org/10.1109/hpdc.1998.709981
Abstract
The potential for faults in distributed computing systems is a significant complicating factor for application developers. While a variety of techniques exist for detecting and correcting faults, the implementation of these techniques in a particular context can be difficult. Hence, we propose a fault detection service designed to be incorporated, in a modular fashion, into distributed computing systems, tools, or applications. This service uses well-known techniques based on unreliable fault detectors to detect and report component failure, while allowing the user to tradeoff timeliness of reporting against false positive rates. We describe the architecture of this service, report on experimental results that quantify its cost and accuracy, and describe its use in two applications, monitoring the status of system components of the GUSTO computational grid testbed and as part of the NetSolve network-enabled numerical solver.Keywords
This publication has 9 references indexed in Scilit:
- Condor-a hunter of idle workstationsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- The Globus project: a status reportPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- A resource management architecture for metacomputing systemsPublished by Springer Nature ,1998
- Measurements and analysis of end-to-end Internet dynamicsPublished by Office of Scientific and Technical Information (OSTI) ,1997
- TotemCommunications of the ACM, 1996
- Unreliable failure detectors for reliable distributed systemsJournal of the ACM, 1996
- PVMPublished by MIT Press ,1994
- The process group approach to reliable distributed computingCommunications of the ACM, 1993
- Characterizing End-to-End Packet Delay and Loss in the InternetJournal of High Speed Networks, 1993