Resilient Distributed Computing
- 1 May 1984
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Software Engineering
- Vol. SE-10 (3) , 257-268
- https://doi.org/10.1109/tse.1984.5010234
Abstract
A control abstraction called atomic action is a powerful general mechanism for ensuring consistent behavior of a system in spite of failures of individual computations running in the system, and in spite of system crashes. However, because of the ``all-or-nothing'' property of atomic actions, an important amount of work might be abandoned needlessly when an internal error is encountered. This paper discusses how implementation of resilient distributed systems can be supported using a combination of nested atomic actions and stable checkpoints. Nested atomic actions form a tree structure. When an internal atomic action terminates, its results are not made permanent until the outermost atomic action commits, but they survive local node failures. Each subtree of atomic actions is recoverable individually. A checkpoint is established in stable storage as part of a remote request so that results of such a request can be reclaimed if the requesting node fails in the meantime, The paper shows how remote procedure call primitives with ``at-most-once'' semantics and recovery blocks can be built with these mechanisms.Keywords
This publication has 20 references indexed in Scilit:
- Recovery Blocks in Action: A System Supporting High ReliabilityPublished by Springer Nature ,1985
- End-to-end arguments in system designACM Transactions on Computer Systems, 1984
- Guardians and Actions: Linguistic Support for Robust, Distributed ProgramsACM Transactions on Programming Languages and Systems, 1983
- Implementing atomic actions on decentralized dataACM Transactions on Computer Systems, 1983
- Transactions and consistency in distributed database systemsACM Transactions on Database Systems, 1982
- Performing remote operations efficiently on a local computer networkCommunications of the ACM, 1982
- The Recovery Manager of the System R Database ManagerACM Computing Surveys, 1981
- Concurrency Control in Distributed Database SystemsACM Computing Surveys, 1981
- A transaction modelPublished by Springer Nature ,1980
- The notions of consistency and predicate locks in a database systemCommunications of the ACM, 1976