Tolerating latency in multiprocessors through compiler-inserted prefetching

1 February 1998

journal article
Published by Association for Computing Machinery (ACM) in ACM Transactions on Computer Systems

Vol. 16 (1) , 55-92
https://doi.org/10.1145/273011.273021

Abstract

The large latency of memory accesses in large-scale shared-memory multiprocessors is a key obstacle to achieving high processor utilization. Software-controlled prefetching is a technique for tolerating memory latency by explicitly executing instructions to move data close to the processor before the data are actually needed. To minimize the burden on the programmer, compiler support is needed to automatically insert prefetch instructions into the code. A key challenge when inserting prefetches is ensuring that the overheads of prefetching do not outweigh the benefits. While previous studies have demonstrated the effectiveness of hand-inserted prefetching in multiprocessor applications, the benefit of compiler-inserted prefetching in practice has remained an open question. This article proposes and evaluates a new compiler algorithm for inserting prefetches into multiprocessor code. The proposed algorithm attempts to minimize overheads by only issuing prefetches for references that are predicted to suffer cache misses. The algorithm can prefetch both dense-matrix and sparse-matrix codes, thus covering a large fraction of scientific applications. We have implemented our algorithm in the SUIF(Stanford University Intermediate Format) optimizing compiler. The results of our detailed architectural simulations demonstrate that compiler-inserted prefetching can improve the speed of some parallel applications by as much as a factor of two.

Keywords

This publication has 8 references indexed in Scilit:

Cache performance in vector supercomputers
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Data cache performance of supercomputer applications
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
A methodology for procedure cloning
Computer Languages, 1993
The Stanford Dash multiprocessor
Computer, 1992
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors
Journal of Parallel and Distributed Computing, 1991
A survey of cache coherence schemes for multiprocessors
Computer, 1990
Synchronization, coherence, and event ordering in multiprocessors
Computer, 1988
Organizing matrices and matrix operations for paged memory systems
Communications of the ACM, 1969