Architectural support for system software on large-scale clusters
- 1 January 2004
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- p. 519-528 vol.1
- https://doi.org/10.1109/icpp.2004.1327962
Abstract
Scalable management of distributed resources is one of the major challenges in deployment of large-scale clusters. Management includes transparent fault tolerance, efficient allocation of resources, and support for all the needs of parallel computing: parallel I/O, deterministic behavior, and responsiveness. Meeting these requirements with commodity hardware and operating systems is difficult because they were not designed to support global management of a large-scale system. We propose a small set of hardware mechanisms in the cluster interconnect to facilitate the implementation of a simple yet powerful global operating system. This system, inspired by concepts from the BSP and SIMD computational models, allows commodity clusters to grow to thousands of nodes while still retaining the usability and responsiveness of the single-node workstation. Our results on a software prototype show that it is possible to implement efficient and scalable system software using the proposed set of mechanisms.Keywords
This publication has 12 references indexed in Scilit:
- The Case of the Missing Supercomputer PerformancePublished by Association for Computing Machinery (ACM) ,2003
- Gang scheduling for highly efficient, distributed multiprocessor systemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Generalized communicators in the Message Passing InterfacePublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- The Quadrics network: high-performance clustering technologyIEEE Micro, 2002
- BProcPublished by Association for Computing Machinery (ACM) ,2002
- Broadcast/Multicast over Myrinet Using NIC-Assisted Multidestination MessagesPublished by Springer Nature ,2000
- GLUix: a global layer unix for a network of workstationsSoftware: Practice and Experience, 1998
- LogP: towards a realistic model of parallel computationPublished by Association for Computing Machinery (ACM) ,1993
- Gang scheduling performance benefits for fine-grain synchronizationJournal of Parallel and Distributed Computing, 1992
- A bridging model for parallel computationCommunications of the ACM, 1990