Polyphase merge sort

In general, for an even number N of working files, each iteration of a balanced merge sort reduces run count by a factor of N/2, while for an odd number N of working files, each iteration reduces the run count by an average factor of √(N2−1)/4 = √N2−1/2.

For N < 8 working files, a polyphase merge sort achieves a higher effective run count reduction factor by unevenly distributing sorted runs between N−1 working files (explained in next section).

The initial distribution is set up so that only one input working file is emptied at a time, except for the final merge iteration which merges N−1 single runs (of varying size, this is explained next) from the N−1 input working files to the single output file, resulting in a single sorted run, the sorted dataset.

For a dataset consisting of 57 runs of 1 record each, then after the initial distribution, polyphase merge sort moves 232 records during the 6 iterations it takes to sort the dataset, for an overall reduction factor of 2.70 (this is explained in more detail later).

[3] It is easiest to look at the polyphase merge starting from its ending conditions and working backwards.

Notice that file 2 is not completely consumed—it has one run left to match the final merge (above).

Looking at the perfect 3 file case, the number of runs for merged working backwards: 1, 1, 2, 3, 5, ... reveals a Fibonacci sequence.

For everything to work out optimally, the last merge phase should have exactly one run on each input file.

In practice, the input file will not have the exact number of runs needed for a perfect distribution.

Another approach is to emulate dummy runs as needed during the merge operations.

[4] "Optimal" distribution algorithms require knowing the number of runs in advance.

Otherwise, in the more common case where the number of runs is not known in advance, "near optimal" distribution algorithms are used.

[5] If the number of runs is known in advance, only a partial distribution is needed before starting the merge phases.

To explain how the reduction factor is related to sort performance, the reduction factor equations are: Using the run move count equation for the above examples: Here is a table of effective reduction factors for polyphase and balanced merge sort listed by number of files, based on actual sorts of a few million records.