Supporting nondeterministic execution in fault-tolerant systems
- 23 December 2002
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- p. 250-259
- https://doi.org/10.1109/ftcs.1996.534611
Abstract
We present a technique to track nondeterminism resulting from asynchronous events and multithreading in log-based rollback-recovery protocols. This technique relies on using a software counter to compute the number of instructions between nondeterministic events in normal operation. Should a failure occur, the instruction counts are used to force the replay of these events at the same execution points. The execution of the application thus can be replayed to recreate the pre-failure state, while accommodating uncontrolled nondeterminism during normal operation. Implementation on a DEC Alpha processor shows that this support has a low overhead, typically less than 6% increase in running time for the applications we studied.Keywords
This publication has 28 references indexed in Scilit:
- Scheduling message processing for reducing rollback propagationPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- Optimistic message logging for independent checkpointing in message-passing systemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- Reducing message logging overhead for log-based recoveryPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Reduced overhead logging for rollback recovery in distributed shared memoryPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Checkpointing and its applicationsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- A case for two-level distributed recovery schemesPublished by Association for Computing Machinery (ACM) ,1995
- Manetho: transparent roll back-recovery with low overhead, limited rollback, and fast output commitIEEE Transactions on Computers, 1992
- Efficient distributed recovery using message loggingPublished by Association for Computing Machinery (ACM) ,1989
- The V distributed systemCommunications of the ACM, 1988
- Optimistic recovery in distributed systemsACM Transactions on Computer Systems, 1985