Basic Linear Algebra Subprograms

BLAS implementations will take advantage of special floating point hardware such as vector registers or SIMD instructions.

iMKL is a freeware[7] and proprietary[8] vendor library optimized for x86 and x86-64 with a performance emphasis on Intel processors.

The LINPACK benchmarks rely heavily on the BLAS routine gemm for its performance measurements.

Many numerical software applications use BLAS-compatible libraries to do linear algebra computations, including LAPACK, LINPACK, Armadillo, GNU Octave, Mathematica,[10] MATLAB,[11] NumPy,[12] R, Julia and Lisp-Stat.

These libraries would contain subroutines for common high-level mathematical operations such as root finding, matrix inversion, and solving systems of equations.

The most prominent numerical programming library was IBM's Scientific Subroutine Package (SSP).

[13] These subroutine libraries allowed programmers to concentrate on their specific problems and avoid re-implementing well-known algorithms.

The library routines would also be better than average implementations; matrix algorithms, for example, might use full pivoting to get better numerical accuracy.

A specification for these kernel operations using scalars and vectors, the level-1 Basic Linear Algebra Subroutines (BLAS), was published in 1979.

"[20] This level, formally published in 1990,[19] contains matrix-matrix operations, including a "general matrix multiplication" (gemm), of the form where A and B can optionally be transposed or hermitian-conjugated inside the routine, and all three matrices may be strided.

Due to the ubiquity of matrix multiplications in many scientific applications, including for the implementation of the rest of Level 3 BLAS,[21] and because faster algorithms exist beyond the obvious repetition of matrix-vector multiplication, gemm is a prime target of optimization for BLAS implementers.

This is one of the motivations for including the β parameter,[dubious – discuss] so the results of previous blocks can be accumulated.

More recently, implementations by Kazushige Goto have shown that blocking only for the L2 cache, combined with careful amortizing of copying to contiguous memory to reduce TLB misses, is superior to ATLAS.

[51] The traditional BLAS functions have been also ported to architectures that support large amounts of parallelism such as GPUs.

Here, the traditional BLAS functions provide typically good performance for large matrices.

However, when computing e.g., matrix-matrix-products of many small matrices by using the GEMM routine, those architectures show significant performance losses.

[52] Taking the GEMM routine from above as an example, the batched version performs the following computation simultaneously for many matrices:

Often, this operation is implemented for a strided batched memory layout where all matrices follow concatenated in the arrays

[53] Here, the matrix exponentiation, the computationally expensive part of the integration, can be implemented in parallel for all time-steps by using Batched BLAS functions.