Cacheminer: A runtime approach to exploit cache locality on SMP

1 April 2000

journal article
Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Parallel and Distributed Systems

Vol. 11 (4) , 357-374
https://doi.org/10.1109/71.850833

Abstract

Exploiting cache locality of parallel programs at runtime is a complementary approach to a compiler optimization. This is particularly important for those applications with dynamic memory access patterns. We propose a memory-layout oriented technique to exploit cache locality of parallel loops at runtime on Symmetric Multiprocessor (SMP) systems. Guided by application-dependent and targeted architecture-dependent hints, our system, called Cacheminer, reorganizes and partitions a parallel loop using the memory-access space of its execution. Through effective runtime transformations, our system maximizes the data reuse in each partitioned data region assigned in a cache, and minimizes the data sharing among the partitioned data regions assigned to all caches. The executions of tasks in the partitions are scheduled in an adaptive and locality-preserved way to minimize the execution time of programs by trading off load balance and locality. We have implemented the Cacheminer runtime library on two commercial SMP servers and an SimOS simulated SMP. Our simulation and measurement results show that our runtime approach can achieve comparable performance with the compiler optimizations for programs with regular computation and memory-access patterns, whose load balance and cache locality can be well optimized by the tiling and other program transformations. However, our experimental results show that our approach is able to significantly improve the memory performance for the applications with irregular computation and dynamic memory access patterns. These types of programs are usually hard to optimize by static compiler optimizations.

Keywords

This publication has 17 references indexed in Scilit:

Using the SimOS machine simulator to study complex computer systems
ACM Transactions on Modeling and Computer Simulation, 1997
Thread scheduling for cache locality
Published by Association for Computing Machinery (ACM) ,1996
A quantitative analysis of loop nest locality
Published by Association for Computing Machinery (ACM) ,1996
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems, 1996
Using processor affinity in loop scheduling on shared-memory multiprocessors
IEEE Transactions on Parallel and Distributed Systems, 1994
Locality and Loop Scheduling on NUMA Multiprocessors
Published by Institute of Electrical and Electronics Engineers (IEEE) ,1993
Global optimizations for parallelism and locality on scalable parallel machines
Published by Association for Computing Machinery (ACM) ,1993
Data locality and load balancing in COOL
Published by Association for Computing Machinery (ACM) ,1993
The cache performance and optimizations of blocked algorithms
Published by Association for Computing Machinery (ACM) ,1991
Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers
IEEE Transactions on Computers, 1987