Radix sort

Radix sort dates back as far as 1887 to the work of Herman Hollerith on tabulating machines.

[2] The first memory-efficient computer algorithm for this sorting method was developed in 1954 at MIT by Harold H. Seward.

Computerized radix sorts had previously been dismissed as impractical because of the perceived need for variable allocation of buckets of unknown size.

Seward's innovation was to use a linear scan to determine the required bucket sizes and offsets beforehand, allowing for a single static allocation of auxiliary memory.

The linear scan is closely related to Seward's other algorithm — counting sort.

In the modern era, radix sorts are most commonly applied to collections of binary strings and integers.

MSD sorts are not necessarily stable if the original ordering of duplicate keys must always be maintained.

Other than the traversal order, MSD and LSD sorts differ in their handling of variable length input.

Although it is always possible to pre-determine the bucket boundaries using counts, some implementations opt to use dynamic memory allocation instead.

Optimized radix sorts can be very fast when working in a domain that suits them.

Large key sizes can hinder LSD implementations when the induced number of passes becomes the bottleneck.

MSD radix sort can be implemented as a stable algorithm, but requires the use of a memory buffer of the same size as the input array.

Thus, equal elements will be placed in the memory buffer in the same order they were in the input array.

After the sort by the last digit has been completed, the output buffer is checked to see if it is the original input array, and if it's not, then a single copy is performed.

In the top level of recursion, opportunity for parallelism is in the counting sort portion of the algorithm.

Counting is highly parallel, amenable to the parallel_reduce pattern, and splits the work well across multiple cores until reaching memory bandwidth limit.

For random inputs all bins would be near equally populated and a large amount of parallelism opportunity would be available.

The fastest known PRAM sorts were described in 1991 by David M W Powers with a parallelized quicksort that can operate in O(log(n)) time on a CRCW-PRAM with n processors by performing partitioning implicitly, as well as a radixsort that operates using the same trick in O(k), where k is the maximum keylength.

[15] However, neither the PRAM architecture or a single sequential processor can actually be built in a way that will scale without the number of constant fan-out gate delays per cycle increasing as O(log(n)), so that in effect a pipelined version of Batcher's bitonic mergesort and the O(log(n)) PRAM sorts are all O(log2(n)) in terms of clock cycles, with Powers acknowledging that Batcher's would have lower constant in terms of gate delays than his Parallel quicksort and radix sort, or Cole's merge sort, for a keylength-independent sorting network of O(nlog2(n)).