On the Execution of Large Batch Programs in Unreliable Computing Systems

Abstract
The execution of long-running batch programs imposes severe reliability constraints on a computing system since the occurrence of a failure during its execution is more likely and that once occurred, a failure would destroy all the processing perfonned thus far. This paper studies the execution delay and machine resources consumed in supporting the running of large batch programs in a computing environment interrupted by failures. The effect of checkpoints and their optimal insertion are also considered. The results are applicable to arbitrary law of failure.

This publication has 6 references indexed in Scilit: