Low-cost checkpointing and failure recovery in mobile computing systems
- 1 October 1996
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Parallel and Distributed Systems
- Vol. 7 (10) , 1035-1048
- https://doi.org/10.1109/71.539735
Abstract
A mobile computing system consists of mobile and stationary nodes, connected to each other by a communication network. The presence of mobile nodes in the system places constraints on the permissible energy consumption and available communication bandwidth. To minimize the lost computation during recovery from node failures, periodic collection of a consistent snapshot of the system (checkpoint) is required. Locating mobile nodes contributes to the checkpointing and recovery costs. Synchronous snapshot collection algorithms, designed for static networks, either force every node in the system to take a new local snapshot, or block the underlying computation during snapshot collection. Hence, they are not suitable for mobile computing systems. If nodes take their local checkpoints independently in an uncoordinated manner, each node may have to store multiple local checkpoints in stable storage. This is not suitable for mobile nodes as they have small memory. This paper presents a synchronous snapshot collection algorithm for mobile systems that neither forces every node to take a local snapshot, nor blocks the underlying computation during snapshot collection. If a node initiates snapshot collection, local snapshots of only those nodes that have directly or transitively affected the initiator since their last snapshots need to be taken. We prove that the global snapshot collection terminates within a finite time of its invocation and the collected global snapshot is consistent. We also propose a minimal rollback/recovery algorithm in which the computation at a node is rolled back only if it depends on operations that have been undone due to the failure of node(s). Both the algorithms have low communication and storage overheads and meet the low energy consumption and low bandwidth constraints of mobile computing systems.Keywords
This publication has 20 references indexed in Scilit:
- Message-optimal incremental snapshotsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approachPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- Structuring distributed algorithms for mobile hostsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- A mobile networking system based on Internet protocolIEEE Wireless Communications, 1994
- Manetho: transparent roll back-recovery with low overhead, limited rollback, and fast output commitIEEE Transactions on Computers, 1992
- A network architecture providing host migration transparencyPublished by Association for Computing Machinery (ACM) ,1991
- Concurrent online tracking of mobile usersPublished by Association for Computing Machinery (ACM) ,1991
- A message-optimal algorithm for distributed termination detectionJournal of Parallel and Distributed Computing, 1990
- Efficient distributed recovery using message loggingPublished by Association for Computing Machinery (ACM) ,1989
- Optimistic recovery in distributed systemsACM Transactions on Computer Systems, 1985