Anatomy of high-performance matrix multiplication

Abstract
We present the basic principles that underlie the high-performance implementation of the matrix-matrix multiplication that is part of the widely used GotoBLAS library. Design decisions are justified by successively refining a model of architectures with multilevel memories. A simple but effective algorithm for executing this operation results. Implementations on a broad selection of architectures are shown to achieve near-peak performance.
Funding Information
  • Advanced Cyberinfrastructure (ACI-0305163CCF-0342369CCF-0540926)
  • Lawrence Livermore National Laboratory, Office of Science (B546489)
  • Division of Computing and Communication Foundations (ACI-0305163CCF-0342369CCF-0540926)

This publication has 12 references indexed in Scilit: