How fail-stop are faulty programs?
- 27 November 2002
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- p. 240-249
- https://doi.org/10.1109/ftcs.1998.689475
Abstract
Most fault-tolerant systems are designed to stop faulty programs before they write permanent data or communicate with other processes. This property (halt-on-failure) forms the core of the fail-stop model. Unfortunately, little experimental data exists on whether or not program failures follow the fail-stop model. This paper describes a tool, based on the SimOS complete-machine simulator that can trace how faults propagate through memory, disk, and functions. Using this tool on the Postgres database system, we conduct a controlled experiment to measure how often faulty programs violate the fail-stop model. We find that a significant number of faults (7%) violate the fail-stop model by writing incorrect data to stable storage before halting. We then apply Postgres' transaction mechanism to undo recent changes before a crash and find that transactions reduce fail-stop violations by a factor of 3.Keywords
This publication has 17 references indexed in Scilit:
- Faults, symptoms, and software fault tolerance in the Tandem GUARDIAN90 operating systemPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Integrating reliable memory in databasesThe VLDB Journal, 1998
- Free transactions with Rio VistaPublished by Association for Computing Machinery (ACM) ,1997
- The Rio file cachePublished by Association for Computing Machinery (ACM) ,1996
- Complete computer system simulation: the SimOS approachIEEE Parallel & Distributed Technology: Systems & Applications, 1995
- FERRARI: a flexible software-based fault and error injection systemIEEE Transactions on Computers, 1995
- FINE: A fault injection and monitoring environment for tracing the UNIX system behavior under faultsIEEE Transactions on Software Engineering, 1993
- Lightweight causal and atomic group multicastACM Transactions on Computer Systems, 1991
- Byzantine generals in actionACM Transactions on Computer Systems, 1984
- System structure for software fault toleranceIEEE Transactions on Software Engineering, 1975