Software-based replication for fault tolerance
- 1 April 1997
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in Computer
- Vol. 30 (4) , 68-74
- https://doi.org/10.1109/2.585156
Abstract
Developers of early distributed systems took a simplistic approach to providing fault tolerance: They just used another copy of the same hardware as a backup. Later, others developed replication software to work on off-the-shelf hardware. Since neither of these methods is especially economical, a logical course is to take it one step further and eliminate the extra hardware altogether. Fully software-based replication relies on sophisticated techniques to keep track of server communications and ensure the consistency of information across several server replicas. How do you know that each server shares the same view of the data or program semantics? What happens if a server replica crashes? How do you make sure that a system processes invocations in the correct order? These are all problems that a replication technique has to handle. The authors describe two fundamental techniques, primary-backup and active replication, and illustrate how they handle these problems. At this point, both have advantages and disadvantages that depend on the application. The authors also propose that group communication provides a sufficient framework for implementing software-based replication. The concept of static and dynamic groups proves useful in thinking about how to implement replication techniques. Replication techniques can also use total-order and view-synchronous multicast primitives from group communication.Keywords
This publication has 10 references indexed in Scilit:
- Highly-available services using the primary-backup approachPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- Uniform reliable multicast in a virtually synchronous environmentPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Consensus service: a modular approach for building agreement protocols in distributed systemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Early consensus in an asynchronous system with a weak failure detectorDistributed Computing, 1997
- Unreliable failure detectors for reliable distributed systemsJournal of the ACM, 1996
- The process group approach to reliable distributed computingCommunications of the ACM, 1993
- Lightweight causal and atomic group multicastACM Transactions on Computer Systems, 1991
- Using process groups to implement failure detection in asynchronous environmentsPublished by Association for Computing Machinery (ACM) ,1991
- Linearizability: a correctness condition for concurrent objectsACM Transactions on Programming Languages and Systems, 1990
- Impossibility of distributed consensus with one faulty processJournal of the ACM, 1985