Minimizing the Network Overhead of Checkpointing in Cycle-harvesting Cluster Environments

1 September 2005

conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

No. 15525244,p. 1-10
https://doi.org/10.1109/clustr.2005.347074

Abstract

Cycle-harvesting systems such as Condor have been developed to make desktop machines in a local area (which are often similar to clusters in hardware configuration) available as a compute platform. To provide a dual-use capability, opportunistic jobs harvesting cycles from the desktop must be checkpointed before the desktop resources are reclaimed by their owners and the job is evacuated. In this paper, we investigate a new system for computing efficient checkpoint schedules in cycle-harvesting environments. Our system records the historical availability from each resource and fits a statistical model to the observations. Because checkpointing must often traverse the network (i.e. the desktop hosts do not provide sufficient persistent storage for checkpoints), we combine this model with predictions of network performance to the storage site to compute a checkpoint schedule. When an application is initiated on a particular resource, the system uses the computed distribution to parameterize a Markov state-transition model for the application's execution, evaluates the expected time and network overhead as a function of the checkpoint interval, and numerically optimizes with respect to time. We report on the performance of and implementation of this system using the Condor cycle-harvesting environment at the University of Wisconsin. We also evaluate the efficiencies we achieve for a variety of network overheads using trace-based simulation. Finally, we validate our simulations against the observed performance with Condor. Our results indicate that while the choice of model distribution has a relatively small but positive effect on time efficiency, it has a substantial impact on network utilization

Keywords

This publication has 16 references indexed in Scilit:

Characterizing and evaluating desktop grids: an empirical study
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2004
Tapestry: A Resilient Global-Scale Overlay for Service Deployment
IEEE Journal on Selected Areas in Communications, 2004
Distributed computing systems and checkpointing
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Experimental assessment of workstation failures and their impact on checkpointing systems
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
A longitudinal survey of Internet host reliability
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Simulation of Folding of a Small Alpha-helical Protein in Atomistic Detail using Worldwide-distributed Computing
Journal of Molecular Biology, 2002
Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems
Journal of Parallel and Distributed Computing, 2001
A variational calculus approach to optimal checkpoint placement
IEEE Transactions on Computers, 2001
Measurement-based evaluation of operating system fault tolerance
IEEE Transactions on Reliability, 1993
Effect of System Workload on Operating System Reliability: A Study on IBM 3081
IEEE Transactions on Software Engineering, 1985