[1] Many different algorithms have been designed for multiplying matrices on different types of hardware, including parallel and distributed systems, where the computational work is spread over multiple processors (perhaps over a network).
[6] The three loops in iterative matrix multiplication can be arbitrarily swapped with each other without an effect on correctness or asymptotic running time.
When n > M/b, every iteration of the inner loop (a simultaneous sweep through a row of A and a column of B) incurs a cache miss when accessing an element of B.
As of 2010[update], the speed of memories compared to that of processors is such that the cache misses, rather than the actual calculations, dominate the running time for sizable matrices.
[7] The optimal variant of the iterative algorithm for A and B in row-major layout is a tiled version, where the matrix is implicitly divided into square tiles of size √M by √M:[7][8] In the idealized cache model, this algorithm incurs only Θ(n3/b √M) cache misses; the divisor b √M amounts to several orders of magnitude on modern machines, so that the actual calculations dominate the running time, rather than the cache misses.
This relies on the block partitioning which works for all square matrices whose dimensions are powers of two, i.e., the shapes are 2n × 2n for some n. The matrix product is now which consists of eight multiplications of pairs of submatrices, followed by an addition step.
Application of the master theorem for divide-and-conquer recurrences shows this recursion to have the solution Θ(n3), the same as the iterative algorithm.
[11] It is very useful for large matrices over exact domains such as finite fields, where numerical stability is not an issue.
Since Strassen's algorithm is actually used in practical numerical software and computer algebra systems improving on the constants hidden in the big O notation has its merits.
In 1976 Probert[16] showed that such an algorithm requires at least 15 additions (including subtractions), however, a hidden assumption was that the blocks and the 2x2-block matrix are represented in the same basis.
Karstadt and Schwartz computed in different bases and traded 3 additions for less expensive basis transformations.
[17] applied this base change trick to more general decompositions than 2x2-block matrices and improved the leading constant for their run times.
It is an open question in theoretical computer science how well Strassen's algorithm can be improved in terms of asymptotic complexity.
[19][20] Victor Pan proposed so-called feasible sub-cubic matrix multiplication algorithms with an exponent slightly above 2.77, but in return with a much smaller hidden constant coefficient.
The best "practical" (explicit low-rank decomposition of a matrix multiplication tensor) algorithm found ran in O(n2.778).
[24] Finding low-rank decompositions of such tensors (and beyond) is NP-hard; optimal multiplication even for 3x3 matrices remains unknown, even in commutative field.
[24] On 4x4 matrices, AlphaTensor unexpectedly discovered a solution with 47 multiplication steps, an improvement over the 49 required with Strassen’s algorithm of 1969, albeit restricted to mod 2 arithmetic.
Based on the surprising discovery that such improvements exist, other researchers were quickly able to find a similar independent 4x4 algorithm, and separately tweaked Deepmind's 96-step 5x5 algorithm down to 95 steps in mod 2 arithmetic and to 97[25] in normal arithmetic.
Exploiting the full parallelism of the problem, one obtains an algorithm that can be expressed in fork–join style pseudocode:[27] Procedure multiply(C, A, B): Procedure add(C, T) adds T into C, element-wise: Here, fork is a keyword that signal a computation may be run in parallel with the rest of the function call, while join waits for all previously "forked" computations to complete.
[28] The naïve algorithm is then used over the block matrices, computing products of submatrices entirely in fast memory.
This reduces communication bandwidth to O(n3/√M), which is asymptotically optimal (for algorithms performing Ω(n3) computation).
[32] On modern distributed computing environments such as MapReduce, specialized multiplication algorithms have been developed.
[34] The standard array is inefficient because the data from the two matrices does not arrive simultaneously and it must be padded with zeroes.
[36] The cross-wired mesh array may be seen as a special case of a non-planar (i.e. multilayered) processing structure.