Software-based replication for fault tolerance

1 April 1997

journal article
Published by Institute of Electrical and Electronics Engineers (IEEE) in Computer

Vol. 30 (4) , 68-74
https://doi.org/10.1109/2.585156

Abstract

Developers of early distributed systems took a simplistic approach to providing fault tolerance: They just used another copy of the same hardware as a backup. Later, others developed replication software to work on off-the-shelf hardware. Since neither of these methods is especially economical, a logical course is to take it one step further and eliminate the extra hardware altogether. Fully software-based replication relies on sophisticated techniques to keep track of server communications and ensure the consistency of information across several server replicas. How do you know that each server shares the same view of the data or program semantics? What happens if a server replica crashes? How do you make sure that a system processes invocations in the correct order? These are all problems that a replication technique has to handle. The authors describe two fundamental techniques, primary-backup and active replication, and illustrate how they handle these problems. At this point, both have advantages and disadvantages that depend on the application. The authors also propose that group communication provides a sufficient framework for implementing software-based replication. The concept of static and dynamic groups proves useful in thinking about how to implement replication techniques. Replication techniques can also use total-order and view-synchronous multicast primitives from group communication.

Keywords

This publication has 10 references indexed in Scilit:

Highly-available services using the primary-backup approach
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2003
Uniform reliable multicast in a virtually synchronous environment
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Consensus service: a modular approach for building agreement protocols in distributed systems
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Early consensus in an asynchronous system with a weak failure detector
Distributed Computing, 1997
Unreliable failure detectors for reliable distributed systems
Journal of the ACM, 1996
The process group approach to reliable distributed computing
Communications of the ACM, 1993
Lightweight causal and atomic group multicast
ACM Transactions on Computer Systems, 1991
Using process groups to implement failure detection in asynchronous environments
Published by Association for Computing Machinery (ACM) ,1991
Linearizability: a correctness condition for concurrent objects
ACM Transactions on Programming Languages and Systems, 1990
Impossibility of distributed consensus with one faulty process
Journal of the ACM, 1985