HPC-Colony

1 April 2006

journal article
Published by Association for Computing Machinery (ACM) in ACM SIGOPS Operating Systems Review

Vol. 40 (2) , 43-49
https://doi.org/10.1145/1131322.1131334

Abstract

Traditional full-featured operating systems are known to have properties that limit the scalability of distributed memory parallel programs, the most common programming paradigm utilized in high end computing. Furthermore, as processor counts increase with the most capable systems, the necessary activity to manage the system becomes more of a burden. To make a general purpose operating system scale to such levels, new technology is required for parallel resource management and global system management (including fault management). In this paper, we describe the shortcomings of full-featured operating systems and runtime systems and discuss an approach to scale such systems to one hundred thousand processors with both scalable parallel application performance and efficient system management.

Keywords

This publication has 23 references indexed in Scilit:

Proactive Fault Tolerance in MPI Applications Via Task Migration
Published by Springer Nature ,2006
New challenges in dynamic load balancing
Applied Numerical Mathematics, 2004
Building and Using a Fault-Tolerant MPI Implementation
The International Journal of High Performance Computing Applications, 2004
Critical event prediction for proactive management in large-scale computer clusters
Published by Association for Computing Machinery (ACM) ,2003
NAMD: Biomolecular Simulation on Thousands of Processors
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
MPI-FT: PORTABLE FAULT TOLERANCE SCHEME FOR MPI
Parallel Processing Letters, 2000
Diffusive load-balancing policies for dynamic applications
IEEE Concurrency, 1999
Strategies for dynamic load balancing on highly parallel computers
IEEE Transactions on Parallel and Distributed Systems, 1993
Manetho: transparent roll back-recovery with low overhead, limited rollback, and fast output commit
IEEE Transactions on Computers, 1992
Optimistic recovery in distributed systems
ACM Transactions on Computer Systems, 1985