Predicting the number of fatal soft errors in Los Alamos national laboratory's ASC Q supercomputer
- 5 December 2005
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Device and Materials Reliability
- Vol. 5 (3) , 329-335
- https://doi.org/10.1109/tdmr.2005.855685
Abstract
Early in the deployment of the Advanced Simulation and Computing (ASC) Q supercomputer, a higher-than-expected number of single-node failures was observed. The elevated rate of single-node failures was hypothesized to be caused primarily by fatal soft errors, i.e., board-level cache (B-cache) tag (BTAG) parity errors caused by cosmic-ray-induced neutrons that led to node crashes. A series of experiments was undertaken at the Los Alamos Neutron Science Center (LANSCE) to ascertain whether fatal soft errors were indeed the primary cause of the elevated rate of single-node failures. Observed failure data from Q are consistent with the results from some of these experiments. Mitigation strategies have been developed, and scientists successfully use Q for large computations in the presence of fatal soft errors and other single-node failures.Keywords
This publication has 17 references indexed in Scilit:
- Measurements and analysis of SER-tolerant latch in a 90-nm dual-V/sub T/ CMOS processIEEE Journal of Solid-State Circuits, 2004
- Soft error rate increase for new generations of SRAMsIEEE Transactions on Nuclear Science, 2003
- The Alpha 21264 microprocessor architecturePublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Cosmic ray neutron multiple-upset measurements in a 0.6-/spl mu/m CMOS processIEEE Transactions on Nuclear Science, 2000
- Cosmic-ray soft error rate characterization of a standard 0.6-/spl mu/m CMOS processIEEE Journal of Solid-State Circuits, 2000
- Measurement and analysis of neutron-induced soft errors in sub-half-micron CMOS circuitsIEEE Transactions on Electron Devices, 1998
- Measurements and analysis of neutron-reaction-induced charges in a silicon surface regionIEEE Transactions on Nuclear Science, 1997
- Single event upset at ground levelIEEE Transactions on Nuclear Science, 1996
- Accelerated testing for cosmic soft-error rateIBM Journal of Research and Development, 1996
- Single event upset and charge collection measurements using high energy protons and neutronsIEEE Transactions on Nuclear Science, 1994