Improved disk-drive failure warnings

Abstract
Improved methods are proposed for disk-drive failure prediction. The SMART (self monitoring and reporting technology) failure prediction system is currently implemented in disk-drives. Its purpose is to predict the near-term failure of an individual hard disk-drive, and issue a backup warning to prevent data loss. Two experimental tests of SMART show only moderate accuracy at low false-alarm rates. (A rate of 0.2% of total drives per year implies that 20% of drive returns would be good drives, relative to /spl ap/1% annual failure rate of drives). This requirement for very low false-alarm rates is well known in medical diagnostic tests for rare diseases, and methodology used there suggests ways to improve SMART. Two improved SMART algorithms are proposed. They use the SMART internal drive attribute measurements in present drives. The present warning-algorithm based on maximum error thresholds is replaced by distribution-free statistical hypothesis tests. These improved algorithms are computationally simple enough to be implemented in drive microprocessor firmware code. They require only integer sort operations to put several hundred attribute values in rank order. Some tens of these ranks are added up and the SMART warning is issued if the sum exceeds a prestored limit. These new algorithms were tested on 3744 drives of 2 models. They gave 3-4 times higher correct prediction accuracy than error thresholds on will-fail drives, at 0.2% false-alarm rate. The highest accuracies achievable are modest (40%-60%). Care was taken to test will-fail drive prediction accuracy on data independent of the algorithm design data. Additional work is needed to verify and apply these algorithms in actual drive design. They can also be useful in drive failure analysis engineering. It might be possible to screen drives in manufacturing using SMART attributes. Marginal drives might be detected before substantial final test time is invested in them, thereby decreasing manufacturing cost, and possibly decreasing overall field failure rates.

This publication has 9 references indexed in Scilit: