A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction

14 March 2010

conference paper
Published by Association for Computing Machinery (ACM)

p. 51-61
https://doi.org/10.1145/1735688.1735698

Abstract

Programmers for GPGPU face rapidly changing substrate of programming abstractions, execution models, and hardware implementations. It has been established, through numerous demonstrations for particular conjunctions of application kernel, programming languages, and GPU hardware instance, that it is possible to achieve significant improvements in the price/performance and energy/performance over general purpose processors. But these demonstrations are each the result of significant dedicated programmer labor, which is likely to be duplicated for each new GPU hardware architecture to achieve performance portability. This paper discusses the implementation, in the R-Stream compiler, of a source to source mapping pathway from a high-level, textbook-style algorithm expression method in ANSI C, to multi-GPGPU accelerated computers. The compiler performs hierarchical decomposition and parallelization of the algorithm between and across host, multiple GPGPUs, and within-GPU. The semantic transformations are expressed within the polyhedral model, including optimization of integrated parallelization, locality, and contiguity tradeoffs. Hierarchical tiling is performed. Communication and synchronizations operations at multiple levels are generated automatically. The resulting mapping is currently emitted in the CUDA programming language. The GPU backend adds to the range of hardware and accelerator targets for R-Stream and indicates the potential for performance portability of single sources across multiple hardware targets.

Keywords

This publication has 8 references indexed in Scilit:

Automatic C-to-CUDA Code Generation for Affine Programs
Published by Springer Nature ,2010
Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors
Published by Association for Computing Machinery (ACM) ,2009
A practical automatic polyhedral parallelizer and locality optimizer
Published by Association for Computing Machinery (ACM) ,2008
A compiler framework for optimization of affine loop nests for gpgpus
Published by Association for Computing Machinery (ACM) ,2008
Analytical computation of Ehrhart polynomials
Published by Association for Computing Machinery (ACM) ,2004
A Polynomial Time Algorithm for Counting Integral Points in Polyhedra When the Dimension is Fixed
Mathematics of Operations Research, 1994
Some efficient solutions to the affine scheduling problem. I. One-dimensional time
International Journal of Parallel Programming, 1992
Scanning polyhedra with DO loops
Published by Association for Computing Machinery (ACM) ,1991