Architectural support for system software on large-scale clusters

1 January 2004

conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 519-528 vol.1
https://doi.org/10.1109/icpp.2004.1327962

Abstract

Scalable management of distributed resources is one of the major challenges in deployment of large-scale clusters. Management includes transparent fault tolerance, efficient allocation of resources, and support for all the needs of parallel computing: parallel I/O, deterministic behavior, and responsiveness. Meeting these requirements with commodity hardware and operating systems is difficult because they were not designed to support global management of a large-scale system. We propose a small set of hardware mechanisms in the cluster interconnect to facilitate the implementation of a simple yet powerful global operating system. This system, inspired by concepts from the BSP and SIMD computational models, allows commodity clusters to grow to thousands of nodes while still retaining the usability and responsiveness of the single-node workstation. Our results on a software prototype show that it is possible to implement efficient and scalable system software using the proposed set of mechanisms.

Keywords

This publication has 12 references indexed in Scilit:

The Case of the Missing Supercomputer Performance
Published by Association for Computing Machinery (ACM) ,2003
Gang scheduling for highly efficient, distributed multiprocessor systems
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Generalized communicators in the Message Passing Interface
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
The Quadrics network: high-performance clustering technology
IEEE Micro, 2002
BProc
Published by Association for Computing Machinery (ACM) ,2002
Broadcast/Multicast over Myrinet Using NIC-Assisted Multidestination Messages
Published by Springer Nature ,2000
GLUix: a global layer unix for a network of workstations
Software: Practice and Experience, 1998
LogP: towards a realistic model of parallel computation
Published by Association for Computing Machinery (ACM) ,1993
Gang scheduling performance benefits for fine-grain synchronization
Journal of Parallel and Distributed Computing, 1992
A bridging model for parallel computation
Communications of the ACM, 1990