Tolerating latency in multiprocessors through compiler-inserted prefetching
- 1 February 1998
- journal article
- Published by Association for Computing Machinery (ACM) in ACM Transactions on Computer Systems
- Vol. 16 (1) , 55-92
- https://doi.org/10.1145/273011.273021
Abstract
The large latency of memory accesses in large-scale shared-memory multiprocessors is a key obstacle to achieving high processor utilization. Software-controlled prefetching is a technique for tolerating memory latency by explicitly executing instructions to move data close to the processor before the data are actually needed. To minimize the burden on the programmer, compiler support is needed to automatically insert prefetch instructions into the code. A key challenge when inserting prefetches is ensuring that the overheads of prefetching do not outweigh the benefits. While previous studies have demonstrated the effectiveness of hand-inserted prefetching in multiprocessor applications, the benefit of compiler-inserted prefetching in practice has remained an open question. This article proposes and evaluates a new compiler algorithm for inserting prefetches into multiprocessor code. The proposed algorithm attempts to minimize overheads by only issuing prefetches for references that are predicted to suffer cache misses. The algorithm can prefetch both dense-matrix and sparse-matrix codes, thus covering a large fraction of scientific applications. We have implemented our algorithm in the SUIF(Stanford University Intermediate Format) optimizing compiler. The results of our detailed architectural simulations demonstrate that compiler-inserted prefetching can improve the speed of some parallel applications by as much as a factor of two.Keywords
This publication has 8 references indexed in Scilit:
- Cache performance in vector supercomputersPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Data cache performance of supercomputer applicationsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- A methodology for procedure cloningComputer Languages, 1993
- The Stanford Dash multiprocessorComputer, 1992
- Tolerating latency through software-controlled prefetching in shared-memory multiprocessorsJournal of Parallel and Distributed Computing, 1991
- A survey of cache coherence schemes for multiprocessorsComputer, 1990
- Synchronization, coherence, and event ordering in multiprocessorsComputer, 1988
- Organizing matrices and matrix operations for paged memory systemsCommunications of the ACM, 1969