Loop nest optimization

After loop tiling is applied using 2 * 2 blocks, the code looks like: The original loop iteration space is n by n. The accessed chunk of array a[i, j] is also n by n. When n is too large and the cache size of the machine is too small, the accessed array elements in one loop iteration (for example, i = 1, j = 1 to n) may cross cache lines, causing cache misses.

Explicit blocking requires choosing a tile size based on these factors.

By contrast, cache-oblivious algorithms are designed to make efficient use of cache without explicit blocking.

Many large mathematical operations on computers end up spending much of their time doing matrix multiplication.

By carrying four accumulators simultaneously, this code can keep a single floating point adder with a latency of 4 busy nearly all the time (problem #1).

This code has had both the i and j iterations blocked by a factor of two and had both the resulting two-iteration inner loops completely unrolled.

As a result, the code above will run slower on the 2.8 GHz Pentium 4 than on the 166 MHz Y-MP!

A machine with a longer floating-point add latency or with multiple adders would require more accumulators to run in parallel.

The loop requires registers to hold both the accumulators and the loaded and reused A and B values.

A 3x3 block requires 13, which will not work on a machine with just 8 floating point registers in the ISA.

With this code, ib can be set to any desired parameter, and the number of loads of the B matrix will be reduced by that factor.

Blocking the k loop means that the C array will be loaded and stored N/kb times, for a total of ⁠

So long as the machine's memory system will keep up with the floating point unit and the code will run at maximum performance.

The above code examples do not show the details of dealing with values of N which are not multiples of the blocking factors.

Compilers which do loop nest optimization emit code to clean up the edges of the computation.

The above loop will only achieve 80% of peak flops on the example system when blocked for the 16KB L1 cache size.