Efficient gather and scatter operations on graphics processors

10 November 2007

proceedings article
Published by Association for Computing Machinery (ACM)

p. 1-12
https://doi.org/10.1145/1362622.1362684

Abstract

Gather and scatter are two fundamental data-parallel operations, where a large number of data items are read (gathered) from or are written (scattered) to given locations. In this paper, we study these two operations on graphics processing units (GPUs). With superior computing power and high memory bandwidth, GPUs have become a commodity multiprocessor platform for general-purpose high-performance computing. However, due to the random access nature of gather and scatter, a naive implementation of the two operations suffers from a low utilization of the memory bandwidth and consequently a long, unhidden memory latency. Additionally, the architectural details of the GPUs, in particular, the memory hierarchy design, are unclear to the programmers. Therefore, we design multi-pass gather and scatter operations to improve their data access locality, and develop a performance model to help understand and optimize these two operations. We have evaluated our algorithms in sorting, hashing, and the sparse matrix-vector multiplication in comparison with their optimized CPU counterparts. Our results show that these optimizations yield 2-4X improvement on the GPU bandwidth utilization and 30-50% improvement on the response time. Overall, our optimized GPU implementations are 2-7X faster than their optimized CPU counterparts. (c) 2007 ACM

Keywords

Funding Information

Research Grants Council, University Grants Committee, Hong Kong (DAG05/06/EG11617206)

This publication has 14 references indexed in Scilit:

A Survey of General‐Purpose Computation on Graphics Hardware
Computer Graphics Forum, 2007
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2005
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication
Published by Association for Computing Machinery (ACM) ,2004
UberFlow
Published by Association for Computing Machinery (ACM) ,2004
Fast computation of database operations using graphics processors
Published by Association for Computing Machinery (ACM) ,2004
Sparse matrix solvers on the GPU
ACM Transactions on Graphics, 2003
Fast matrix multiplies using graphics hardware
Published by Association for Computing Machinery (ACM) ,2001
External memory algorithms and data structures
ACM Computing Surveys, 2001
Radix sort for vector multiprocessors
Published by Association for Computing Machinery (ACM) ,1991
The input/output complexity of sorting and related problems
Communications of the ACM, 1988