Samplesort

[1] Conventional divide and conquer sorting algorithms partitions the array into sub-intervals or buckets.

However, if the array is non-uniformly distributed, the performance of these sorting algorithms can be significantly throttled.

Samplesort addresses this issue by selecting a sample of size s from the n-element sequence, and determining the range of the buckets by sorting the sample and choosing p−1 < s elements from the result.

These elements (called splitters) then divide the array into p approximately equal-sized buckets.

Where quicksort partitions its input into two parts at each step, based on a single value called the pivot, samplesort instead takes a larger sample from its input and divides its data into buckets accordingly.

[5] In the following, A is the unsorted data, k is the oversampling factor, discussed later, and p is the number of splitters.

The number of comparisons, performed by this algorithm, approaches the information theoretical optimum

In experiments, conducted by Frazer and McKellar, the algorithm needed 15% fewer comparisons than quicksort.

[6][4][7] Due to the variable amount of splitters (in contrast to only one pivot in Quicksort), Samplesort is very well suited and intuitive for parallelization and scaling.

Samplesort is efficient in parallel systems because each processor receives approximately the same bucket size

With the resulting splitters, each processor places its own input data into local buckets.

Experiments performed in the early 1990s on Connection Machine supercomputers showed samplesort to be particularly good at sorting large datasets on these machines, because its incurs little interprocessor communication overhead.

[9][citation needed] As described above, the samplesort algorithm splits the elements according to the selected splitters.

An efficient implementation strategy is proposed in the paper "Super Scalar Sample Sort".

(the original array containing the input data and a temporary one) for an efficient implementation.

In each recursion step, the data gets copied to the other array in a partitioned fashion.

Given the search tree t, the algorithm calculates the bucket number j of element

Thus, there occur no branch mispredictions, which would slow down the comparison operation significantly.

For an efficient partitioning of the elements, the algorithm needs to know the sizes of the buckets in advance.

To partition the elements of the sequence and put them into the array, we need to know the size of the buckets in advance.

To avoid this doubling of comparisons, Super Scalar Sample Sort uses an additional array

bits, these cost are small compared to the space of the input array.

A key disadvantage of the efficient Samplesort implementation shown above is that it is not in-place and requires a second temporary array of the same size as the input sequence during sorting.

However, the algorithm performs up to three times faster than other state of the art in-place competitors and up to 1.5 times faster than other state of the art sequential competitors.

Thereafter, each processor scans its stripe and moves the elements into the buffer of the according bucket.

Firstly, a prefix sum operation is performed that calculates the boundaries of the buckets.

To limit work contention, each processor is assigned a different primary bucket

In each step, if both swap buffers are empty, the processor decrements the read pointer

Otherwise the block remaining in the swap buffers has to be inserted into its destination bucket.

Frazer and McKellar's samplesort and derivatives: Adapted for use on parallel computers: