Parallel Block Matrix Factorizations on the Shared-Memory Multiprocessor Ibm 3090 VF/600J

Abstract
Efficient parallel block algorithms for the LU factorization with partial pivoting, the Cholesky factorization, and the QR factorization transportable over a range of parallel MIMD architectures are presented. Parallel implementations of different block algorithms that utilize optimized uniprocessor level-3 BLAS are compared with corresponding routines of LAPACK (under development). Parallelism is mainly invoked implicitly in LAPACK by replacing calls to uniprocessor level-3 kernels by calls to parallel level-3 kernels and thereby maintaining portability. However, by parallelizing at the block level (explicitly) it is possible to overlap and pipeline different matrix-matrix operations and thereby gain some performance. Theoretical models give upper bounds on the best possible speedup of the explicitly and implicitly parallel block algorithms for the target machine.

This publication has 13 references indexed in Scilit: