Improving the reliability of commodity operating systems
- 2 February 2005
- journal article
- Published by Association for Computing Machinery (ACM) in ACM Transactions on Computer Systems
- Vol. 23 (1) , 77-110
- https://doi.org/10.1145/1047915.1047919
Abstract
Despite decades of research in extensible operating system technology, extensions such as device drivers remain a significant cause of system failures. In Windows XP, for example, drivers account for 85% of recently reported failures.This article describes Nooks, a reliability subsystem that seeks to greatly enhance operating system (OS) reliability by isolating the OS from driver failures. The Nooks approach is practical: rather than guaranteeing complete fault tolerance through a new (and incompatible) OS or driver architecture, our goal is to prevent the vast majority of driver-caused crashes with little or no change to the existing driver and system code. Nooks isolates drivers within lightweight protection domains inside the kernel address space, where hardware and software prevent them from corrupting the kernel. Nooks also tracks a driver's use of kernel resources to facilitate automatic cleanup during recovery.To prove the viability of our approach, we implemented Nooks in the Linux operating system and used it to fault-isolate several device drivers. Our results show that Nooks offers a substantial increase in the reliability of operating systems, catching and quickly recovering from many faults that would otherwise crash the system. Under a wide range and number of fault conditions, we show that Nooks recovers automatically from 99% of the faults that otherwise cause Linux to crash.While Nooks was designed for drivers, our techniques generalize to other kernel extensions. We demonstrate this by isolating a kernel-mode file system and an in-kernel Internet service. Overall, because Nooks supports existing C-language extensions, runs on a commodity operating system and hardware, and enables automated recovery, it represents a substantial step beyond the specialized architectures and type-safe languages required by previous efforts directed at safe extensibility.Keywords
This publication has 10 references indexed in Scilit:
- Building dependable COTS microkernel-based systems using MAFALDAPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Fault injection techniques and toolsComputer, 1997
- Sharing and protection in a single-address-space operating systemACM Transactions on Computer Systems, 1994
- The X window system, version 11Software: Practice and Experience, 1990
- Lightweight remote procedure callACM Transactions on Computer Systems, 1990
- Implementing remote procedure callsACM Transactions on Computer Systems, 1984
- Fault Tolerant Operating SystemsACM Computing Surveys, 1976
- Protection and the control of information sharing in multicsCommunications of the ACM, 1974
- Capability-based addressingCommunications of the ACM, 1974
- Programming semantics for multiprogrammed computationsCommunications of the ACM, 1966