Distributed Recovery in Fault-Tolerant Multiprocessor Networks
- 1 October 1986
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Computers
- Vol. C-35 (10) , 871-879
- https://doi.org/10.1109/tc.1986.1676678
Abstract
A methodology for characterizing dynamic distributed recovery in fault-tolerant multiprocessor systems is developed using graph theory. Distributed recovery, which is intended for systems with no central supervisor, depends on the cooperation of a set of processors to execute the recovery function, since each processor is assumed to have only a limited amount of information about the system as a whole. Facility graphs, whose nodes denote the system components (processors), and whose edges denote interconnection between components, are used to represent multiprocessor systems, and error conditions. A general distributed recovery strategy R, which allows global recovery to be achieved via a sequence of local actions, is given. R recovers the system in several steps in which different nodes successively act as the local supervisor. R is specialized for two important classes of systems: loop networks and tree networks. For each of these cases, fault-tolerant designs and their associated distributed recovery strategies, which allow recovery from up to k faults within a specified number of steps, are presented.Keywords
This publication has 14 references indexed in Scilit:
- Fault Tolerance in Binary Tree ArchitecturesIEEE Transactions on Computers, 1984
- The Basic Fault-tolerant SystemIEEE Micro, 1984
- Program Graphs and Execution BehaviorIEEE Transactions on Software Engineering, 1983
- An optimal 2‐FT realization of binary symmetric hierarchical tree systemsNetworks, 1982
- A Model for Representing Programs Using Hierarchical GraphsIEEE Transactions on Software Engineering, 1981
- Distributed Loop Computer NetworksPublished by Elsevier ,1978
- Multiprocessor Organization—a SurveyACM Computing Surveys, 1977
- A Graph Model for Fault-Tolerant Computing SystemsIEEE Transactions on Computers, 1976
- A Digital Loop Communication SystemIEEE Transactions on Communications, 1974
- On the Connection Assignment Problem of Diagnosable SystemsIEEE Transactions on Electronic Computers, 1967