Software-implemented EDAC protection against SEUs
- 1 September 2000
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Reliability
- Vol. 49 (3) , 273-284
- https://doi.org/10.1109/24.914544
Abstract
In many computer systems, the contents of memory are protected by an error detection and correction (EDAC) code. Bit-flips caused by single event upsets (SEU) are a well-known problem in memory chips; EDAC codes have been an effective solution to this problem. These codes are usually implemented in hardware using extra memory bits and encoding/decoding circuitry. In systems where EDAC hardware is not available, the reliability of the system can be improved by providing protection through software. Codes and techniques that can be used for software implementation of EDAC are discussed and compared. The implementation requirements and issues are discussed, and some solutions are presented. The paper discusses in detail how system-level and chip-level structures relate to multiple error correction. A simple solution is presented to make the EDAC scheme independent of these structures. The technique in this paper was implemented and used effectively in an actual space experiment. We have observed that SEU corrupt the operating system or programs of a computer system that does not have any EDAC for memory, forcing the system to be reset frequently. Protecting the entire memory (code and data) might not be practical in software. However this paper demonstrates that software-implemented EDAC is a low-cost solution that provides protection for code segments and can appreciably enhance the system availability in a low-radiation space environment.Keywords
This publication has 20 references indexed in Scilit:
- Control-flow checking by software signaturesIEEE Transactions on Reliability, 2002
- Error detection by duplicated instructions in super-scalar processorsIEEE Transactions on Reliability, 2002
- Fast software implementation of error detection codesIEEE/ACM Transactions on Networking, 1995
- Reliability of semiconductor RAMs with soft-error scrubbing techniquesIEE Proceedings - Computers and Digital Techniques, 1995
- Fault-tolerant features in the HaL memory management unitIEEE Transactions on Computers, 1995
- Efficient coding and error monitoring for spacecraft digital memoryInternational Journal of Electronics, 1992
- The reliability of semiconductor RAM memories with on-chip error-correction codingIEEE Transactions on Information Theory, 1991
- Reliability of scrubbing recovery-techniques for memory systemsIEEE Transactions on Reliability, 1990
- Computation of cyclic redundancy checks via table look-upCommunications of the ACM, 1988
- Memory System Design for Tolerating Single Event UpsetsIEEE Transactions on Nuclear Science, 1983