Scaling application performance on a cache-coherent multiprocessor

1 May 1999

journal article
Published by Association for Computing Machinery (ACM) in ACM SIGARCH Computer Architecture News

Vol. 27 (2) , 305-316
https://doi.org/10.1145/307338.301005

Abstract

Hardware-coherent, distributed shared address space systems are increasingly successful at moderate scale. However, it is unclear whether, or with how much difficulty, the performance of a load-store shared address space programming model scales to large processor counts on real applications. We examine this question using an aggressive case-study machine, the SGI Origin2000, up to 128 processors. We show for the first time that scalable performance can indeed be achieved in this programming model on a wide range of applications, including challenging kernels like FFT. However, this does not come easily, even for applications considered to be already highly optimized, and is very often not simply a matter of increasing problem size. Rather, substantial further application restructuring is often needed, which is usually quite algorithmic in nature. We examine how the restructurings compare with those needed for performance portability to shared virtual memory on clusters, and we comment on common programming guidelines for performance portability and scalability as well as on how the programming difficulty compares with that of explicit message passing. We also examine where applications spend their time on this large machine, the impact of special hardware features that the machine provides, and the impact of mapping to the network topology.

Keywords

This publication has 14 references indexed in Scilit:

Evaluating synchronization on shared address space multiprocessors
Published by Association for Computing Machinery (ACM) ,1999
A methodology and an evaluation of the SGI Origin2000
Published by Association for Computing Machinery (ACM) ,1998
Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors
Published by Association for Computing Machinery (ACM) ,1997
The SGI Origin
Published by Association for Computing Machinery (ACM) ,1997
Implications of hierarchical N-body methods for multiprocessor architectures
ACM Transactions on Computer Systems, 1995
The SPLASH-2 programs
Published by Association for Computing Machinery (ACM) ,1995
Parallel visualization algorithms: performance and architectural implications
Computer, 1994
An empirical comparison of the Kendall Square Research KSR-1 and Stanford DASH multiprocessors
Published by Association for Computing Machinery (ACM) ,1993
The DASH prototype: Logic overhead and performance
IEEE Transactions on Parallel and Distributed Systems, 1993
LogP: towards a realistic model of parallel computation
Published by Association for Computing Machinery (ACM) ,1993