Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures
- 5 June 2012
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Computers
- Vol. 61 (12) , 1724-1736
- https://doi.org/10.1109/tc.2012.132
Abstract
As technology is reaching physical limits, reducing power consumption is a key issue on our path to sustained performance. In this paper, we study fundamental tradeoffs and limits in efficiency (as measured in energy per operation) that can be achieved for an important class of kernels, namely the level-3 Basic Linear Algebra Subprograms (BLAS). It is well-accepted that specialization is the key to efficiency. This paper establishes a baseline by studying GEneral Matrix-matrix Multiplication (GEMM) on a variety of custom and general-purpose CPU and GPU architectures. Our analysis shows that orders of magnitude improvements in efficiency are possible with relatively simple customizations and fine-tuning of memory hierarchy configurations. We argue that these customizations can be generalized to perform other representative linear algebra operations. In addition to exposing the sources of inefficiencies in current CPUs and GPUs, our results show our prototype Linear Algebra Processor (LAP) implementing Double-precision GEMM (DGEMM) can achieve 600 GFLOPS while consuming less than 25 Watts in standard 45 nm technology, which is up to 50 × more energy efficient than cutting-edge CPUs.Keywords
This publication has 41 references indexed in Scilit:
- Understanding sources of inefficiency in general-purpose chipsPublished by Association for Computing Machinery (ACM) ,2010
- High-performance floating-point implementation using FPGASPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2009
- A matrix product accelerator for field programmable systems on chipMicroprocessors and Microsystems, 2008
- An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOSIEEE Journal of Solid-State Circuits, 2008
- Floating-Point Matrix Multiplication in a Polymorphic ProcessorPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2007
- Scalable and Modular Algorithms for Floating-Point Matrix Multiplication on Reconfigurable Computing SystemsIEEE Transactions on Parallel and Distributed Systems, 2007
- Energy- and time-efficient matrix multiplication on FPGAsIEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2005
- 64-bit floating-point FPGA matrix multiplicationPublished by Association for Computing Machinery (ACM) ,2005
- A poly-algorithm for parallel dense matrix multiplication on two-dimensional process grid topologiesConcurrency: Practice and Experience, 1997
- On synthesizing optimal family of linear systolic arrays for matrix multiplicationIEEE Transactions on Computers, 1991