Trade-offs in loop transformations

Abstract

Nowadays, multimedia systems deal with huge amounts of memory accesses and large memory footprints. To alleviate the impact of these accesses and reduce the memory footprint, high-level memory exploration and optimization techniques have been proposed. These techniques try to more efficiently utilize the memory hierarchy. An important step in these optimization techniques are loop transformations (LT). They have a crucial effect on later data memory footprint optimization steps and code generation. However, the state-of-the-art work has focused only on individual objectives. The main one in literature involves improving the locality of data accesses, and thus reducing the data memory footprint. It does not consider the trade-offs in the LT step in relation to successive optimization steps. Therefore, it is not globally efficient in mapping the application on the target platform. In this article we will discuss several trade-offs during the loop transformations. To our knowledge, we are the first ones considering these global trade-offs. Previous work always gave mostly one solution, having the best locality and thus the optimized memory footprint, even though some research in two-dimensional trade-offs in this area exists as well. We start from this state-of-the-art solution with minimal footprint. We show that by sacrificing the footprint, we can obtain gains in data reuse (crucial for energy reduction) and reduce the control-flow complexity. We demonstrate our approach on a real-life application, namely the QSDPCM video coder. At the end, we show that considering trade-offs for this application leads to 16% energy reduction in a two-layer memory subsystem and 10% cycle reduction on the ARM platform.

Keywords

This publication has 26 references indexed in Scilit:

Array recovery and high-level transformations for DSP applications
ACM Transactions on Embedded Computing Systems, 2003
A layout-conscious iteration space transformation technique
IEEE Transactions on Computers, 2001
Energy-aware runtime scheduling for embedded-multiprocessor SOCs
IEEE Design & Test of Computers, 2001
Affine-by-Statement Scheduling of Uniform and Affine Loop Nests over Parametric Domains
Journal of Parallel and Distributed Computing, 1995
Portable video-on-demand in wireless communication
Proceedings of the IEEE, 1995
Automatic program parallelization
Proceedings of the IEEE, 1993
A practical algorithm for exact array dependence analysis
Communications of the ACM, 1992
Architecture-driven synthesis techniques for VLSI implementation of DSP algorithms
Proceedings of the IEEE, 1990
Compiler optimizations for enhancing parallelism and their impact on architecture design
IEEE Transactions on Computers, 1988
Automatic translation of FORTRAN programs to vector form
ACM Transactions on Programming Languages and Systems, 1987