Principles of fault tolerance
- 23 December 2002
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- Vol. 1, 18-25 vol.1
- https://doi.org/10.1109/apec.1996.500416
Abstract
The demand for continuously available electronic systems increases every day. Transaction processing, communications systems, and critical processes all require nonstop, fault tolerant operation. Creating a fault tolerant or highly available system can be achieved by following four basic principles: redundancy, fault isolation, fault detection and annunciation, and on-line repair. This paper is a tutorial that presents those four principles after reviewing some fundamentals of reliability and availability. It concludes with an expanded discussion on implementing redundancy. Special considerations for high availability and fault tolerance in distributed power systems are highlighted.Keywords
This publication has 5 references indexed in Scilit:
- An intelligent, fault tolerant, high power, distributed power system for massively parallel processing computersPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Fault tolerant hot-pluggable power system designPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Power system design for massive parallel computer systemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Fault tolerance in distributed power systemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Availability, MTBF and MTTR for Repairable M out of N SystemIEEE Transactions on Reliability, 1981