Reliability mechanisms for very large storage systems
- 27 August 2003
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
Abstract
Reliability and availability are increasingly important in large-scale storage systems built from thousands of individual storage devices. Large systems must survive the failure of individual components; in systems with thousands of disks, even infrequent failures are likely in some device. We focus on two types of errors: nonrecoverable read errors and drive failures. We discuss mechanisms for detecting and recovering from such errors, introducing improved techniques for detecting errors in disk reads and fast recovery from disk failure. We show that simple RAID cannot guarantee sufficient reliability; our analysis examines the tradeoffs among other schemes between system availability and storage efficiency. Based on our data, we believe that two-way mirroring should be sufficient for most large storage systems. For those that need very high reliability, we recommend either three-way mirroring or mirroring combined with RAID.Keywords
This publication has 16 references indexed in Scilit:
- Distributed sparing in disk arraysPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- A technique for managing mirrored disksPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Erasure Coding Vs. Replication: A Quantitative ComparisonPublished by Springer Nature ,2002
- PangaeaPublished by Association for Computing Machinery (ACM) ,2002
- FarsitePublished by Association for Computing Machinery (ACM) ,2002
- LH*RSPublished by Association for Computing Machinery (ACM) ,2000
- A cost-effective, high-bandwidth storage architecturePublished by Association for Computing Machinery (ACM) ,1998
- A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systemsSoftware: Practice and Experience, 1997
- The HP AutoRAID hierarchical storage systemPublished by Association for Computing Machinery (ACM) ,1995
- RAID: high-performance, reliable secondary storageACM Computing Surveys, 1994