Cheap recovery
- 1 February 2005
- journal article
- Published by Association for Computing Machinery (ACM) in ACM Transactions on Storage
- Vol. 1 (1) , 38-70
- https://doi.org/10.1145/1044956.1044959
Abstract
Cluster hash tables (CHTs) are key components of many large-scale Internet services due to their highly-scalable performance and the prevalence of the type of data they store. Another advantage of CHTs is that they can be designed to be as self-managing as a cluster of stateless servers. One key to achieving this extreme manageability is reboot-based recovery that is predictably fast and has modest impact on system performance and availability. This "cheap" recovery mechanism simplifies management in two ways. First, it simplifies failure detection by lowering the cost of acting on false positives. This enables one to use statistical techniques to turn hard-to-catch failures, such as node degradation, into failure, followed by recovery. Second, cheap recovery simplifies capacity planning by recasting repartitioning as failure plus recovery to achieve zero-downtime incremental scaling. These low-cost recovery and scaling mechanisms make it possible for the system to be continuously self-adjusting, a key property of self-managing systems.Keywords
This publication has 9 references indexed in Scilit:
- Self-*Storage: Brick-based storage with automated administrationPublished by Defense Technical Information Center (DTIC) ,2003
- A Conversation with Jim GrayQueue, 2003
- The Ninja architecture for robust Internet-scale systems and servicesComputer Networks, 2001
- Lessons from giant-scale servicesIEEE Internet Computing, 2001
- RAID: high-performance, reliable secondary storageACM Computing Surveys, 1994
- Disconnected operation in the Coda File SystemACM Transactions on Computer Systems, 1992
- Correct memory operation of cache-based multiprocessorsPublished by Association for Computing Machinery (ACM) ,1987
- Consistency in a partitioned network: a surveyACM Computing Surveys, 1985
- A Majority consensus approach to concurrency control for multiple copy databasesACM Transactions on Database Systems, 1979