Application-transparent process-level error recovery for multicomputers
- 7 January 2003
- proceedings article
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- Vol. 1, 296-305
- https://doi.org/10.1109/hicss.1989.47170
Abstract
A multicomputer system consisting of hundreds of processors interconnected by point-to-point links can achieve high performance for many important applications. We propose a new application-transparent, process-level, distributed error recovery scheme for multicomputers. Checkpointing is initiated by timers at intervals determined by the needs of the application. Checkpointing and recovery involve only as much of the system as is necessary: a set of interacting processes. Processes which are not part of the interacting set do not participate in checkpointing or recovery and continue to do useful work. Several checkpoint and/or recovery sessions may be active simultaneously. The scheme does not require significant overhead during normal operation since it is not necessary to make message transmission atomic, acknowledge each message, or transmit check bits with each packet. We discuss variations of our technique using packet-switching or virtual circuits, and compare our scheme to previously published techniques.Keywords
This publication has 6 references indexed in Scilit:
- Optimistic recovery in distributed systemsACM Transactions on Computer Systems, 1985
- Distributed snapshotsACM Transactions on Computer Systems, 1985
- The cosmic cubeCommunications of the ACM, 1985
- Termination detection for diffusing computationsInformation Processing Letters, 1980
- Reliability Issues in Computing System DesignACM Computing Surveys, 1978
- Communication In X-TREE, A Modular Multiprocessor SystemPublished by Association for Computing Machinery (ACM) ,1978