Divergence Analysis and Optimizations
- 1 October 2011
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- Vol. 27 (1089795X) , 320-329
- https://doi.org/10.1109/pact.2011.63
Abstract
The growing interest in GPU programming has brought renewed attention to the Single Instruction Multiple Data (SIMD) execution model. SIMD machines give application developers a tremendous computational power, however, the model also brings restrictions. In particular, processing elements (PEs) execute in lock-step, and may lose performance due to divergences caused by conditional branches. In face of divergences, some PEs execute, while others wait, this alternation ending when they reach a synchronization point. In this paper we introduce divergence analysis, a static analysis that determines which program variables will have the same values for every PE. This analysis is useful in three different ways: it improves the translation of SIMD code to non-SIMD CPUs, it helps developers to manually improve their SIMD applications, and it also guides the compiler in the optimization of SIMD programs. We demonstrate this last point by introducing branch fusion, a new compiler optimization that identifies, via a gene sequencing algorithm, chains of similarities between divergent program paths, and weaves these paths together as much as possible. Our implementation has been accepted in the Ocelot open-source CUDA compiler, and is publicly available. We have tested it on many industrial-strength GPU benchmarks, including Rodinia and the Nvidia's SDK. Our divergence analysis has a 34% false-positive rate, compared to the results of a dynamic profiler. Our automatic optimization adds a 3% speed-up onto parallel quick sort, a heavily optimized benchmark. Our manual optimizations extend this number to over 10%.Keywords
This publication has 31 references indexed in Scilit:
- On-the-fly elimination of dynamic irregularities for GPU computingPublished by Association for Computing Machinery (ACM) ,2011
- Understanding throughput-oriented architecturesCommunications of the ACM, 2010
- Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUsPublished by Association for Computing Machinery (ACM) ,2010
- LarrabeeACM Transactions on Graphics, 2008
- Introducing Control Flow into Vectorized CodePublished by Institute of Electrical and Electronics Engineers (IEEE) ,2007
- Dynamic Warp Formation and Scheduling for Efficient GPU Control FlowPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2007
- Identification of common molecular subsequencesPublished by Elsevier ,2004
- Formal specification of parallel SIMD executionTheoretical Computer Science, 1996
- The program dependence graph and its use in optimizationACM Transactions on Programming Languages and Systems, 1987
- Some Computer Organizations and Their EffectivenessIEEE Transactions on Computers, 1972