Using logging and asynchronous checkpointing to implement recoverable distributed shared memory

30 December 2002

conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 58-67
https://doi.org/10.1109/reldis.1993.393473

Abstract

Distributed shared memory provides a useful paradigm for developing distributed applications. As the number of processors in the system and running time of distributed applications increase, the likelihood of processor failure increases. A method of recovering processes running in a distributed shared memory environment which minimizes lost work and the cost of recovery is desirable so that long-running applications are not adversely affected by processor failure. A technique for achieving recoverable distributed shared memory which utilizes asynchronous process checkpoints and logging of pages accessed via read operations on the shared address space is presented. The scheme supports independent process recovery without forcing rollback of operational processes during recovery. The method is particularly useful in environments where taking process checkpoints is expensive.

Keywords

This publication has 14 references indexed in Scilit:

Crash recovery with little overhead
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Fast recovery in distributed shared virtual memory systems
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Fault tolerant distributed shared memory algorithms
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Recovery in distributed systems using optimistic message logging and checkpointing
Journal of Algorithms, 1990
Algorithms implementing distributed shared memory
Computer, 1990
Recoverable distributed shared virtual memory
IEEE Transactions on Computers, 1990
Distributed Checkpointing for Globally Consistent States of Databases
IEEE Transactions on Software Engineering, 1989
Efficient distributed recovery using message logging
Published by Association for Computing Machinery (ACM) ,1989
Optimistic recovery in distributed systems
ACM Transactions on Computer Systems, 1985
A Majority consensus approach to concurrency control for multiple copy databases
ACM Transactions on Database Systems, 1979