Abstract
An important objective of software fault tolerant systems should be to provide a fault-tolerance infrastructure in a manner that minimizes the effort required by the application developer. In the limit, the objective is to provide fault tolerance transparently to the application. TFT, the work presented in this paper, provides transparent fault-tolerance at a higher interface than prior solutions. TFT coordinates replicas at the system call interface, interposing a supervisor agent between the application and the operating system. Moving the replica coordination to this interface allows uncorrelated faults within the operating system and below to be tolerated and also admits the possibility of online operating system and hardware upgrades. To accomplish its task, TFT must enforce a deterministic computation above the system call interface. The potential sources of non-determinism addressed include non-deterministic system calls, delivery of asynchronous events, and the representation of operating system abstractions that differ between replicas.

This publication has 10 references indexed in Scilit: