Commercial fault tolerance: a tale of two systems
- 4 October 2004
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Dependable and Secure Computing
- Vol. 1 (1) , 87-96
- https://doi.org/10.1109/tdsc.2004.4
Abstract
This paper compares and contrasts the design philosophies and implementations of two computer system families: the IBM S/360 and its evolution to the current zSeries line, and the Tandem (now HP) NonStop/spl reg/ Server. Both systems have a long history; the initial IBM S/360 machines were shipped in 1964, and the Tandem NonStop System was first shipped in 1976. They were aimed at similar markets, what would today be called enterprise-class applications. The requirement for the original S/360 line was for very high availability; the requirement for the NonStop platform was for single fault tolerance against unplanned outages. Since their initial shipments, availability expectations for both platforms have continued to rise and the system designers and developers have been challenged to keep up. There were and still are many similarities in the design philosophies of the two lines, including the use of redundant components and extensive error checking. The primary difference is that the S/360-zSeries focus has been on localized retry and restore to keep processors functioning as long as possible, while the NonStop developers have based systems on a loosely coupled multiprocessor design that supports a "fail-fast" philosophy implemented through a combination of hardware and software, with workload being actively taken over by another resource when one fails.Keywords
This publication has 15 references indexed in Scilit:
- The vision of autonomic computingComputer, 2003
- The risk of data corruption in microprocessor-based systemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- How fail-stop are faulty programs?Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- The S/390 G5/G6 binodal cacheIBM Journal of Research and Development, 1999
- S/390 CMOS server I/O: The continuing evolutionIBM Journal of Research and Development, 1997
- Software dependability in the Tandem GUARDIAN systemIEEE Transactions on Software Engineering, 1995
- TNet: a reliable system area networkIEEE Micro, 1995
- A NonStop kernelPublished by Association for Computing Machinery (ACM) ,1981
- The nucleus of a multiprogramming systemCommunications of the ACM, 1970
- The structure of the “THE”-multiprogramming systemCommunications of the ACM, 1968