Abstract
Checkpointing and rollback recovery is essential for long-running parallel applications. In the case of a transient fault or system crash, the affected application programs can recover from a consistent set of checkpoints saved earlier instead of restarting from the very beginning. For applications requiring transparent fault tolerance, log-based recovery can usually achieve a better recoverable state at the cost of message logging in addition to checkpointing. A simple scheme for reducing message logging overhead based on local dependency information is presented. Communication trace-driven simulation for several parallel applications is used to evaluate the benefits of the proposed scheme for real applications.

This publication has 15 references indexed in Scilit: