A Massively Parallel Coprocessor for Convolutional Neural Networks
Top Cited Papers
- 1 July 2009
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- No. 10636862,p. 53-60
- https://doi.org/10.1109/asap.2009.25
Abstract
We present a massively parallel coprocessor for accelerating Convolutional Neural Networks (CNNs), a class of important machine learning algorithms. The coprocessor functional units, consisting of parallel 2D convolution primitives and programmable units performing sub-sampling and non-linear functions specific to CNNs, implement a ldquometa-operatorrdquo to which a CNN may be compiled to. The coprocessor is serviced by distributed off-chip memory banks with large data bandwidth. As a key feature, we use low precision data and further increase the effective memory bandwidth by packing multiple words in every memory operation, and leverage the algorithmpsilas simple data access patterns to use off-chip memory as a scratchpad for intermediate data, critical for CNNs. A CNN is mapped to the coprocessor hardware primitives with instructions to transfer data between the memory and coprocessor. We have implemented a prototype of the CNN coprocessor on an off-the-shelf PCI FPGA card with a single Xilinx Virtex5 LX330T FPGA and 4 DDR2 memory banks totaling 1 GB. The coprocessor prototype can process at the rate of 3.4 billion multiply accumulates per second (GMACs) for CNN forward propagation, a speed that is 31x faster than a software implementation on a 2.2 GHz AMD Opteron processor. For a complete face recognition application with the CNN on the coprocessor and the rest of the image processing tasks on the host, the prototype is 6-10times faster, depending on the host-coprocessor bandwidth.Keywords
This publication has 11 references indexed in Scilit:
- A unified architecture for natural language processingPublished by Association for Computing Machinery (ACM) ,2008
- Fast support vector machine training and classification on graphics processorsPublished by Association for Computing Machinery (ACM) ,2008
- The Impact of Arithmetic Representation on Implementing MLP-BP on FPGAs: A StudyIEEE Transactions on Neural Networks, 2007
- Area-efficient 2-D shift-variant convolvers for FPGA-based digital image processingIEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 2006
- High Performance Linear Algebra Operations on Reconfigurable SystemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2005
- FPGA Implementation of a Pipelined On-Line BackpropagationJournal of Signal Processing Systems, 2005
- Design and implementation of a 2D convolution core for video applications on FPGAsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2004
- A neural network FPGA implementationPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Efficient BackPropPublished by Springer Nature ,1998
- Gradient-based learning applied to document recognitionProceedings of the IEEE, 1998