Scalable locality

Computer software is said to exhibit scalable locality[1] if it can continue to make use of processors that out-pace their memory systems, to solve ever larger problems.

This term is a high-performance uniprocessor analog of the use of scalable parallelism to refer to software for which increasing numbers of processors can be employed for larger problems.

Here, we could achieve any compute balance we desire by simply choosing a large enough T. However, when N is large, this code will still not exhibit good cache reuse, due to poor locality of reference: by the time new(1,1) is needed in the second assignment, or the second time step's execution of the first assignment, the cache line holding new(1,1) will have been overwritten with some other part of one of the arrays.

To produce a very high degree of locality, for example 500 (to run this code efficiently with an array that will not fit in RAM and is relegated to virtual memory), we must re-use values across time steps.

Optimization across time steps has been explored in a number of research compilers; see work by Wonnacott,[1][2] by Song and Li,[3] or by Sadayappan et al.[4] for details of some approaches to time-tiling.