Recursive restartability: turning the reboot sledgehammer into a scalpel
- 25 August 2005
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
Abstract
Even after decades of software engineering research, complex computer systems still fail, primarily due to nondeterministic bugs that are typically resolved by rebooting. Conceding that Heisenbugs will remain a fact of life, we propose a systematic investigation of restarts as "high availability medicine." In this paper we show how recursive restartability (RR) - the ability of a system to gracefully tolerate restarts at multiple levels improves fault tolerance, reduces time-to-repair and enables system designers to build flexible, highly available software infrastructures. Using several examples of widely deployed software systems, we identify properties that are required of RR systems and outline an agenda for turning the recursive restartability philosophy into a practical software structuring tool. Finally, we describe infrastructural support for RR systems, along with initial ideas on how to analyze and benchmark such systems.Keywords
This publication has 17 references indexed in Scilit:
- Session guarantees for weakly consistent replicated dataPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Software rejuvenation: analysis, module and applicationsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Cluster-based scalable network servicesPublished by Association for Computing Machinery (ACM) ,1997
- Online aggregationPublished by Association for Computing Machinery (ACM) ,1997
- Toward real microkernelsCommunications of the ACM, 1996
- The synchronization of periodic routing messagesIEEE/ACM Transactions on Networking, 1994
- RSVP: a new resource ReSerVation ProtocolIEEE Network, 1993
- Efficient software-based fault isolationPublished by Association for Computing Machinery (ACM) ,1993
- Hints for computer system designACM SIGOPS Operating Systems Review, 1983
- Notes on data base operating systemsPublished by Springer Nature ,1978