Coding for high availability of a distributed-parallel storage system
- 1 January 1998
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Parallel and Distributed Systems
- Vol. 9 (12) , 1237-1252
- https://doi.org/10.1109/71.737699
Abstract
We have developed a distributed parallel storage system that employs the aggregate bandwidth of multiple data servers connected by a high-speed wide-area network to achieve scalability and high data throughput. This paper studies different schemes to enhance the reliability and availability of such network-based distributed storage systems. The general approach of this paper employs "erasure" error-correcting codes that can be used to reconstruct missing information caused by hardware, software, or human faults. The paper describes the approach and develops optimized algorithms for the encoding and decoding operations. Moreover, the paper presents techniques for reducing the communication and computation overhead incurred while reconstructing missing data from the redundant information. These techniques include clustering, multidimensional coding, and the full two-dimensional parity schemes. The paper considers trade-offs between redundancy, fault tolerance, and complexity of error recovery.Keywords
This publication has 11 references indexed in Scilit:
- Using high speed networks to enable distributed parallel image server systemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Intrusion tolerance in distributed computing systemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- The MAGlC project: from vision to realityIEEE Network, 1996
- EVENODD: an efficient scheme for tolerating double disk failures in RAID architecturesIEEE Transactions on Computers, 1995
- Distributed parallel data storage systemsPublished by Association for Computing Machinery (ACM) ,1994
- Reliable broadband communication using a burst erasure correcting codePublished by Association for Computing Machinery (ACM) ,1990
- Coda: a highly available file system for a distributed workstation environmentIEEE Transactions on Computers, 1990
- Efficient dispersal of information for security, load balancing, and fault toleranceJournal of the ACM, 1989
- A case for redundant arrays of inexpensive disks (RAID)Published by Association for Computing Machinery (ACM) ,1988
- Error-Correction Coding for Digital CommunicationsPublished by Springer Nature ,1981