We present several ideas for the development of sequential and parallel dense linear algebra software. The algorithms of Linpack and Eispack and later LAPACK and ScaLAPACK have stood the test of time in terms of robustness and accuracy. We focus on producing high performance versions of these algorithms. Our main results use the Algorithms and Architecture Approach. It will be seen that these ideas affect both Architecture and Compiler Design. The paper briefly discusses the following topics:

1. The Linear Transform Approach as a general way to produce traditional algorithms.

2. The underlying role ofMatrixMultiplication.

3. The use of matrix partitioning to describe traditional algorithms.

4. The two standard Data Structures of Dense Linear Algebra hurt performance.

5. The Packed Data Format has poor performance relative to Standard Full Format.

6. Standard full format meshes well with Industry Standards.

7. Standard full format waste half the storage for Triangular matrices.

8. A novel hybrid full format for Triangularmatrices.

9. The LAPACK and Level 3 BLAS approach has a basic flaw.

10. The ScaLAPACK and PBLAS approach has the same basic flaw.

11. New Block-based Data Structures for Matrices remove the flaw.

12. Industry Standards relative to Dense Linear Algebra Software.

13. Industry Standards relative to Distributed Memory Computing

14. The role of concatenating submatrices to facilitate hardware streaming.

15. A need for Architecture to enlarge the Floating Point Register File.

We describe a new result by showing that representing a matrix A as a collection of square blocks can reduce the amount of data reformating required by LAPACK factorization algorithms from O(n^{3}) to O(n^{2}). New results are presented for rectangular full packed format which is a standard full format array using minimal storage for representing a symmetric or triangular matrix. The LAPACK library contains some 125 times two (full and packed storage ) routines for symmetric or triangular matrices. Equivalent routines, written for this new format, usually consist of just calls to Level 3 BLAS and existing LAPACK full format routines. Hence they are trivial to produce. We give several examples for Cholesky factorization and an example for both triangular inverse and Cholesky inverse. Finally, we introduce two new distributed memory near minimal storage algorithms using square block packed format for Cholesky factorization. We only give performance results for Square Block Format Cholesky, on an IBM Power 3, and Rectangular Full Format, on an IBM Power 4.

By:* Fred G. Gustavson*

Published in: RC23715 in 2005

**LIMITED DISTRIBUTION NOTICE:**

**This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.**

Questions about this service can be mailed to reports@us.ibm.com .