A GPGPU compiler for memory optimization and parallelism management

Top Cited Papers

5 June 2010

proceedings article
Published by Association for Computing Machinery (ACM)

Vol. 45 (6) , 86-97
https://doi.org/10.1145/1806596.1806606

Abstract

This paper presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and judicious management of parallelism. The input to our compiler is a naïve GPU kernel function, which is functionally correct but without any consideration for performance optimization. The compiler analyzes the code, identifies its memory access patterns, and generates both the optimized kernel and the kernel invocation parameters. Our optimization process includes vectorization and memory coalescing for memory bandwidth enhancement, tiling and unrolling for data reuse and parallelism management, and thread block remapping or address-offset insertion for partition-camping elimination. The experiments on a set of scientific and media processing algorithms show that our optimized code achieves very high performance, either superior or very close to the highly fine-tuned library, NVIDIA CUBLAS 2.2, and up to 128 times speedups over the naive versions. Another distinguishing feature of our compiler is the understandability of the optimized code, which is useful for performance analysis and algorithm refinement.

Keywords

This publication has 11 references indexed in Scilit:

An adaptive performance modeling tool for GPU architectures
Published by Association for Computing Machinery (ACM) ,2010
A cross-input adaptive framework for GPU program optimizations
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2009
OpenMP to GPGPU
Published by Association for Computing Machinery (ACM) ,2009
A compiler framework for optimization of affine loop nests for gpgpus
Published by Association for Computing Machinery (ACM) ,2008
Program optimization space pruning for a multithreaded gpu
Published by Association for Computing Machinery (ACM) ,2008
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Published by Association for Computing Machinery (ACM) ,2008
Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories
Published by Association for Computing Machinery (ACM) ,2008
CUDA-Lite: Reducing GPU Programming Complexity
Published by Springer Nature ,2008
Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2007
An algorithm for the machine calculation of complex Fourier series
Mathematics of Computation, 1965