Cache-oblivious distribution sort

It is similar to quicksort, but it is a cache-oblivious algorithm, designed for a setting where the number of elements to sort is too large to fit in a cache where operations are done.

items on a machine with cache of size

This number of memory transfers has been shown to be asymptotically optimal for comparison sorts.

This distribution sort also achieves the asymptotically optimal runtime complexity of

Distribution sort operates on a contiguous array of

To sort the elements, it performs the following:

The distribution step algorithm maintains two invariants.

Initially, the algorithm starts with one empty bucket with pivot

The split is done by performing the linear time median finding algorithm, and partitioning based on this median.

At the end of the distribution step, all elements are in the buckets, and the two invariants will still hold.

To accomplish this, each subarray and bucket will have a state associated with it.

The state of a subarray consists of an index next of the next element to be read from the subarray, and a bucket number bnum indicating which bucket index the element should be copied to.

(Note that when we split a bucket, we have to increment all bnum values of all subarrays whose bnum value is greater than the index of the bucket that is split.)

Consider the follow basic strategy: iterate through each subarray, attempting to copy over its element at position next.

Otherwise, increment bnum until a bucket whose pivot is large enough is found.

Though this correctly distributes all elements, it does not exhibit a good cache performance.

Instead, the distribution step is performed in a recursive divide-and-conquer.

It requires as a precondition that each subarray r in the range

Pseudocode for the implementation of distribute is shown below: The base case, where m=1, has a call to the subroutine copy_elems.

In this base case, all elements from subarray i that belong to bucket j are added at once.