Derivation and Calibration of a Transient Error Reliability Model
- 1 July 1982
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Computers
- Vol. C-31 (7) , 658-671
- https://doi.org/10.1109/tc.1982.1676063
Abstract
In this paper a new modeling methodology to characterize failure processes in digital computers due to hardware transients is presented. The basic assumption made is that system sensitivity to hardware transient errors is a function of critical resources usage. The failure rate of a given resource is approximated by a deterministic function of time, depending on the average workload of that resource, plus a Gaussian process. The probability density function of the time to failure obtained under this assumption has a decreasing hazard function, explaining why decreasing hazard function densities such as the Weibull fit experimental data so well. Data on transient errors obtained from several systems are analyzed. Statistical tests confirm the good fit between decreasing hazard distributions and actual data. Finally, models of common fault-tolerant redundant structures are developed using decreasing hazard function distributions. The analysis indicates significant differences between reliability predictions based on the exponential distribution and those based on decreasing hazard function distributions. Reliability differences of 0.2 and factors greater than 2 in Mission Time Improvement are seen in model results. System designers should be aware of these differences.Keywords
This publication has 12 references indexed in Scilit:
- Fundamental limits in digital information processingProceedings of the IEEE, 1981
- Effect of Cosmic Rays on Computer MemoriesScience, 1979
- The Evolution of the DECsystem-10Published by Elsevier ,1978
- The CRAY-1 computer systemCommunications of the ACM, 1978
- A case study of C.mmp, Cm*, and C.vmp: Part II—Predicting and calibrating reliability of multiprocessor systemsProceedings of the IEEE, 1978
- A case study of C.mmp, Cm*, and C.vmp: Part I—Experiences with fault tolerance in multiprocessor systemsProceedings of the IEEE, 1978
- Estimating Weibull Parameters by Linear and Nonlinear RegressionTechnometrics, 1974
- The UNIX time-sharing systemCommunications of the ACM, 1974
- Inferences on the Parameters of the Weibull DistributionTechnometrics, 1969
- Reliability modeling techniques for self-repairing computer systemsPublished by Association for Computing Machinery (ACM) ,1969