HPC-Colony
- 1 April 2006
- journal article
- Published by Association for Computing Machinery (ACM) in ACM SIGOPS Operating Systems Review
- Vol. 40 (2) , 43-49
- https://doi.org/10.1145/1131322.1131334
Abstract
Traditional full-featured operating systems are known to have properties that limit the scalability of distributed memory parallel programs, the most common programming paradigm utilized in high end computing. Furthermore, as processor counts increase with the most capable systems, the necessary activity to manage the system becomes more of a burden. To make a general purpose operating system scale to such levels, new technology is required for parallel resource management and global system management (including fault management). In this paper, we describe the shortcomings of full-featured operating systems and runtime systems and discuss an approach to scale such systems to one hundred thousand processors with both scalable parallel application performance and efficient system management.Keywords
This publication has 23 references indexed in Scilit:
- Proactive Fault Tolerance in MPI Applications Via Task MigrationPublished by Springer Nature ,2006
- New challenges in dynamic load balancingApplied Numerical Mathematics, 2004
- Building and Using a Fault-Tolerant MPI ImplementationThe International Journal of High Performance Computing Applications, 2004
- Critical event prediction for proactive management in large-scale computer clustersPublished by Association for Computing Machinery (ACM) ,2003
- NAMD: Biomolecular Simulation on Thousands of ProcessorsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- MPI-FT: PORTABLE FAULT TOLERANCE SCHEME FOR MPIParallel Processing Letters, 2000
- Diffusive load-balancing policies for dynamic applicationsIEEE Concurrency, 1999
- Strategies for dynamic load balancing on highly parallel computersIEEE Transactions on Parallel and Distributed Systems, 1993
- Manetho: transparent roll back-recovery with low overhead, limited rollback, and fast output commitIEEE Transactions on Computers, 1992
- Optimistic recovery in distributed systemsACM Transactions on Computer Systems, 1985