Analyzing CUDA workloads using a detailed GPU simulator

Top Cited Papers

1 April 2009

conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 163-174
https://doi.org/10.1109/ispass.2009.4919648

Abstract

Modern graphic processing units (GPUs) provide sufficiently flexible programming models that understanding their performance can provide insight in designing tomorrow's manycore processors, whether those are GPUs or otherwise. The combination of multiple, multithreaded, SIMD cores makes studying these GPUs useful in understanding tradeoffs among memory, data, and thread level parallelism. While modern GPUs offer orders of magnitude more raw computing power than contemporary CPUs, many important applications, even those with abundant data level parallelism, do not achieve peak performance. This paper characterizes several non-graphics applications written in NVIDIA's CUDA programming model by running them on a novel detailed microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set. For this study, we selected twelve non-trivial CUDA applications demonstrating varying levels of performance improvement on GPU hardware (versus a CPU-only sequential version of the application). We study the performance of these applications on our GPU performance simulator with configurations comparable to contemporary high-end graphics cards. We characterize the performance impact of several microarchitecture design choices including choice of interconnect topology, use of caches, design of memory controller, parallel workload distribution mechanisms, and memory request coalescing hardware. Two observations we make are (1) that for the applications we study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and (2) that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system.

Keywords

This publication has 17 references indexed in Scilit:

NVIDIA Tesla: A Unified Graphics and Computing Architecture
IEEE Micro, 2008
Program optimization space pruning for a multithreaded gpu
Published by Association for Computing Machinery (ACM) ,2008
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Published by Association for Computing Machinery (ACM) ,2008
Accelerating Large Graph Algorithms on the GPU Using CUDA
Published by Springer Nature ,2008
High-throughput sequence alignment using Graphics Processing Units
BMC Bioinformatics, 2007
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2007
A flexible simulation framework for graphics architectures
Published by Association for Computing Machinery (ACM) ,2004
Exploring the VLSI scalability of stream processors
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2003
Prefetching in a texture cache architecture
Published by Association for Computing Machinery (ACM) ,1998
The design and analysis of a cache architecture for texture mapping
ACM SIGARCH Computer Architecture News, 1997