In-place matrix transposition

Typically, the matrix is assumed to be stored in row-major or column-major order (i.e., contiguous rows or columns, respectively, arranged consecutively).

The problem of non-square in-place transposition has been studied since at least the late 1950s, and several algorithms are known, including several which attempt to optimize locality for cache, out-of-core, or similar memory-related contexts.

On a computer, one can often avoid explicitly transposing a matrix in memory by simply accessing the same data in a different order.

For example, software libraries for linear algebra, such as BLAS, typically provide options to specify that certain matrices are to be interpreted in transposed order to avoid data movement.

However, there remain a number of circumstances in which it is necessary or desirable to physically reorder a matrix in memory to its transposed ordering.

If repeated operations need to be performed on the columns, for example in a fast Fourier transform algorithm (e.g. Frigo & Johnson, 2005), transposing the matrix in memory (to make the columns contiguous) may improve performance by increasing memory locality.

Since these situations normally coincide with the case of very large matrices (which exceed the cache size), performing the transposition in-place with minimal additional storage becomes desirable.

If we transpose this, we obtain the 4×2 matrix: which is stored in computer memory as the sequence (11, 21, 12, 22, 13, 23, 14, 24).

In the following, we assume that the N×M matrix is stored in row-major order with zero-based indices.

This means that the (n,m) element, for n = 0,...,N−1 and m = 0,...,M−1, is stored at an address a = Mn + m (plus some offset in memory, which we ignore).

In the transposed M×N matrix, the corresponding (m,n) element is stored at the address a' = Nm + n, again in row-major order.

If N − 1 and M − 1 are coprime, on the other hand, the only two fixed points are the upper-left and lower-right corners of the matrix.

The following briefly summarizes the published algorithms to perform in-place matrix transposition.

Pseudocode to accomplish this (assuming zero-based array indices) is: This type of implementation, while simple, can exhibit poor performance due to poor cache-line utilization, especially when N is a power of two (due to cache-line conflicts in a CPU cache with limited associativity).

One solution to improve the cache utilization is to "block" the algorithm to operate on several numbers at once, in blocks given by the cache-line size; unfortunately, this means that the algorithm depends on the size of the cache line (it is "cache-aware"), and on a modern computer with multiple levels of cache it requires multiple levels of machine-dependent blocking.

(When N is sufficiently small, the simple algorithm above is used as a base case, as naively recurring all the way down to N=1 would have excessive function-call overhead.)

This is a cache-oblivious algorithm, in the sense that it can exploit the cache line without the cache-line size being an explicit parameter.

However, they also involve a considerable amount of arithmetic to compute the cycles, and require heavily non-consecutive memory accesses since the adjacent elements of the cycles differ by multiplicative factors of N, as discussed above.

That is, they may move each data element more than once, but they involve more consecutive memory access (greater spatial locality), which can improve performance on modern CPUs that rely on caches, as well as on SIMD architectures optimized for processing consecutive data blocks.

The oldest context in which the spatial locality of transposition seems to have been studied is for out-of-core operation (by Alltop, 1975), where the matrix is too large to fit into main memory ("core").

Another algorithm for non-coprime dimensions, involving multiple subsidiary transpositions, was described by Catanzaro et al. (2014).

Frigo & Johnson (2005) describe the adaptation of these algorithms to use cache-oblivious techniques for general-purpose CPUs relying on cache lines to exploit spatial locality.

Work on out-of-core matrix transposition, where the matrix does not fit in main memory and must be stored largely on a hard disk, has focused largely on the N = M square-matrix case, with some exceptions (e.g. Alltop, 1975).