Cacheminer: A runtime approach to exploit cache locality on SMP
- 1 April 2000
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Parallel and Distributed Systems
- Vol. 11 (4) , 357-374
- https://doi.org/10.1109/71.850833
Abstract
Exploiting cache locality of parallel programs at runtime is a complementary approach to a compiler optimization. This is particularly important for those applications with dynamic memory access patterns. We propose a memory-layout oriented technique to exploit cache locality of parallel loops at runtime on Symmetric Multiprocessor (SMP) systems. Guided by application-dependent and targeted architecture-dependent hints, our system, called Cacheminer, reorganizes and partitions a parallel loop using the memory-access space of its execution. Through effective runtime transformations, our system maximizes the data reuse in each partitioned data region assigned in a cache, and minimizes the data sharing among the partitioned data regions assigned to all caches. The executions of tasks in the partitions are scheduled in an adaptive and locality-preserved way to minimize the execution time of programs by trading off load balance and locality. We have implemented the Cacheminer runtime library on two commercial SMP servers and an SimOS simulated SMP. Our simulation and measurement results show that our runtime approach can achieve comparable performance with the compiler optimizations for programs with regular computation and memory-access patterns, whose load balance and cache locality can be well optimized by the tiling and other program transformations. However, our experimental results show that our approach is able to significantly improve the memory performance for the applications with irregular computation and dynamic memory access patterns. These types of programs are usually hard to optimize by static compiler optimizations.Keywords
This publication has 17 references indexed in Scilit:
- Using the SimOS machine simulator to study complex computer systemsACM Transactions on Modeling and Computer Simulation, 1997
- Thread scheduling for cache localityPublished by Association for Computing Machinery (ACM) ,1996
- A quantitative analysis of loop nest localityPublished by Association for Computing Machinery (ACM) ,1996
- Improving data locality with loop transformationsACM Transactions on Programming Languages and Systems, 1996
- Using processor affinity in loop scheduling on shared-memory multiprocessorsIEEE Transactions on Parallel and Distributed Systems, 1994
- Locality and Loop Scheduling on NUMA MultiprocessorsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,1993
- Global optimizations for parallelism and locality on scalable parallel machinesPublished by Association for Computing Machinery (ACM) ,1993
- Data locality and load balancing in COOLPublished by Association for Computing Machinery (ACM) ,1993
- The cache performance and optimizations of blocked algorithmsPublished by Association for Computing Machinery (ACM) ,1991
- Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel SupercomputersIEEE Transactions on Computers, 1987