Minimizing the Network Overhead of Checkpointing in Cycle-harvesting Cluster Environments
- 1 September 2005
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- No. 15525244,p. 1-10
- https://doi.org/10.1109/clustr.2005.347074
Abstract
Cycle-harvesting systems such as Condor have been developed to make desktop machines in a local area (which are often similar to clusters in hardware configuration) available as a compute platform. To provide a dual-use capability, opportunistic jobs harvesting cycles from the desktop must be checkpointed before the desktop resources are reclaimed by their owners and the job is evacuated. In this paper, we investigate a new system for computing efficient checkpoint schedules in cycle-harvesting environments. Our system records the historical availability from each resource and fits a statistical model to the observations. Because checkpointing must often traverse the network (i.e. the desktop hosts do not provide sufficient persistent storage for checkpoints), we combine this model with predictions of network performance to the storage site to compute a checkpoint schedule. When an application is initiated on a particular resource, the system uses the computed distribution to parameterize a Markov state-transition model for the application's execution, evaluates the expected time and network overhead as a function of the checkpoint interval, and numerically optimizes with respect to time. We report on the performance of and implementation of this system using the Condor cycle-harvesting environment at the University of Wisconsin. We also evaluate the efficiencies we achieve for a variety of network overheads using trace-based simulation. Finally, we validate our simulations against the observed performance with Condor. Our results indicate that while the choice of model distribution has a relatively small but positive effect on time efficiency, it has a substantial impact on network utilizationKeywords
This publication has 16 references indexed in Scilit:
- Characterizing and evaluating desktop grids: an empirical studyPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2004
- Tapestry: A Resilient Global-Scale Overlay for Service DeploymentIEEE Journal on Selected Areas in Communications, 2004
- Distributed computing systems and checkpointingPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Experimental assessment of workstation failures and their impact on checkpointing systemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- A longitudinal survey of Internet host reliabilityPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Simulation of Folding of a Small Alpha-helical Protein in Atomistic Detail using Worldwide-distributed ComputingJournal of Molecular Biology, 2002
- Processor Allocation and Checkpoint Interval Selection in Cluster Computing SystemsJournal of Parallel and Distributed Computing, 2001
- A variational calculus approach to optimal checkpoint placementIEEE Transactions on Computers, 2001
- Measurement-based evaluation of operating system fault toleranceIEEE Transactions on Reliability, 1993
- Effect of System Workload on Operating System Reliability: A Study on IBM 3081IEEE Transactions on Software Engineering, 1985