TFT: a software system for application-transparent fault tolerance
- 27 November 2002
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- p. 128-137
- https://doi.org/10.1109/ftcs.1998.689462
Abstract
An important objective of software fault tolerant systems should be to provide a fault-tolerance infrastructure in a manner that minimizes the effort required by the application developer. In the limit, the objective is to provide fault tolerance transparently to the application. TFT, the work presented in this paper, provides transparent fault-tolerance at a higher interface than prior solutions. TFT coordinates replicas at the system call interface, interposing a supervisor agent between the application and the operating system. Moving the replica coordination to this interface allows uncorrelated faults within the operating system and below to be tolerated and also admits the possibility of online operating system and hardware upgrades. To accomplish its task, TFT must enforce a deterministic computation above the system call interface. The potential sources of non-determinism addressed include non-deterministic system calls, delivery of asynchronous events, and the representation of operating system abstractions that differ between replicas.Keywords
This publication has 10 references indexed in Scilit:
- Tradeoffs when integrating multiple software components into a highly available applicationPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Hypervisor-based fault toleranceACM Transactions on Computer Systems, 1996
- Manetho: transparent roll back-recovery with low overhead, limited rollback, and fast output commitIEEE Transactions on Computers, 1992
- Implementing fault-tolerant services using the state machine approach: a tutorialACM Computing Surveys, 1990
- Fault tolerance under UNIXACM Transactions on Computer Systems, 1989
- Fail-stop processorsACM Transactions on Computer Systems, 1983
- The LOCUS distributed operating systemPublished by Association for Computing Machinery (ACM) ,1983
- A NonStop kernelPublished by Association for Computing Machinery (ACM) ,1981
- Time, clocks, and the ordering of events in a distributed systemCommunications of the ACM, 1978
- FTMP—A highly reliable fault-tolerant multiprocess for aircraftProceedings of the IEEE, 1978