Tuning Strassen's Matrix Multiplication for Memory Efficiency
- 1 January 1998
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
Abstract
Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. We report on an implementation of this algorithm that uses several unconventional techniques to make the algorithm memory-friendly. First, the algorithm internally uses a non- standard array layout known as Morton order that is based on a quad-tree decomposition of the matrix. Second, we dynamically select the recursion truncation point to minimize padding without affecting the performance of the algorithm, which we can do by virtue of the cache behavior of the Morton ordering. Each technique is critical for performance, and their combination as done in our code multiplies their effectiveness. Performance comparisons of our implementation with that of competing implementations show that our implementation often outperforms the alternative techniques (up to 25%). However, we also observe wide variability across platforms and across matrix sizes, indicating that at this time, no single implementation is a clear choice for all platforms or matrix sizes. We also note that the time required to convert matrices to/from Morton order is a noticeable amount of execution time (5% to 15%). Eliminating this overhead further reduces our execution time.Keywords
This publication has 18 references indexed in Scilit:
- Dynamic partitioning of non-uniform structured workloads with spacefilling curvesIEEE Transactions on Parallel and Distributed Systems, 1996
- GEMMW: A Portable Level 3 BLAS Winograd Variant of Strassen's Matrix-Matrix Multiply AlgorithmJournal of Computational Physics, 1994
- The cache performance and optimizations of blocked algorithmsPublished by Association for Computing Machinery (ACM) ,1991
- Costs of quadtree representation of nondense matricesJournal of Parallel and Distributed Computing, 1990
- A set of level 3 basic linear algebra subprogramsACM Transactions on Mathematical Software, 1990
- Evaluating associativity in CPU cachesIEEE Transactions on Computers, 1989
- Experiments with quadtree representation of matricesPublished by Springer Nature ,1989
- Extra High Speed Matrix Multiplication on the Cray-2SIAM Journal on Scientific and Statistical Computing, 1988
- On memory requirements of Strassen's algorithmsPublished by Springer Nature ,1976
- Gaussian elimination is not optimalNumerische Mathematik, 1969