Is Search Really Necessary to Generate High-Performance BLAS?
- 27 June 2005
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in Proceedings of the IEEE
- Vol. 93 (2) , 358-386
- https://doi.org/10.1109/jproc.2004.840444
Abstract
A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and loop unrolling factors. Traditional compilers use simple analytical models to compute these values. In contrast, library generators like ATLAS use global search over the space of parameter values by generating programs with many different combinations of parameter values, and running them on the actual hardware to determine which values give the best performance. It is widely believed that traditional model-driven optimization cannot compete with search-based empirical optimization because tractable analytical models cannot capture all the complexities of modern high-performance architectures, but few quantitative comparisons have been done to date. To make such a comparison, we replaced the global search engine in ATLAS with a model-driven optimization engine and measured the relative performance of the code produced by the two systems on a variety of architectures. Since both systems use the same code generator, any differences in the performance of the code produced by the two systems can come only from differences in optimization parameter values. Our experiments show that model-driven optimization can be surprisingly effective and can generate code with performance comparable to that of code generated by ATLAS using global search.Keywords
This publication has 23 references indexed in Scilit:
- SPIRAL: Code Generation for DSP TransformsProceedings of the IEEE, 2005
- Minimizing development and maintenance costs in supporting persistently optimized BLASSoftware: Practice and Experience, 2005
- Exact analysis of the cache behavior of nested loopsPublished by Association for Computing Machinery (ACM) ,2001
- Counting solutions to linear and nonlinear constraints through Ehrhart polynomialsPublished by Association for Computing Machinery (ACM) ,1996
- Tile size selection using cache organization and data layoutPublished by Association for Computing Machinery (ACM) ,1995
- Unifying data and control transformations for distributed shared-memory machinesPublished by Association for Computing Machinery (ACM) ,1995
- Engineering and Scientific Subroutine Library Release 3 for IBM ES/3090 vector multiprocessors [Technical note]IBM Systems Journal, 1989
- Advanced compiler optimizations for supercomputersCommunications of the ACM, 1986
- Evaluation techniques for storage hierarchiesIBM Systems Journal, 1970
- Organizing matrices and matrix operations for paged memory systemsCommunications of the ACM, 1969