Merge algorithm

This can be done by copying the sub-arrays into a temporary array, then applying the merge algorithm above.

[1] The allocation of a temporary array can be avoided, but at the expense of speed and programming ease.

[6] It can be improved by storing the lists in a priority queue (min-heap) keyed by their first element: Searching for the next smallest element to be output (find-min) and restoring heap order can now be done in O(log k) time (more specifically, 2⌊log k⌋ comparisons[6]), and the full problem can be solved in O(n log k) time (approximately 2n⌊log k⌋ comparisons).

The following pseudocode demonstrates this algorithm in a parallel divide-and-conquer style (adapted from Cormen et al.[7]: 800 ).

The algorithm operates by splitting either A or B, whichever is larger, into (nearly) equal halves.

Hybrid approach, where serial algorithm is used for recursion base case has been shown to perform well in practice [8] The work performed by the algorithm for two arrays holding a total of n elements, i.e., the running time of a serial version of it, is O(n).

This is optimal since n elements need to be copied into C. To calculate the span of the algorithm, it is necessary to derive a Recurrence relation.

cost of the Binary Search, we obtain this recurrence as an upper bound:

, meaning that it takes that much time on an ideal machine with an unbounded number of processors.

There are also algorithms that introduce parallelism within a single instance of merging of two sorted lists.

These can be used in field-programmable gate arrays (FPGAs), specialized sorting circuits, as well as in modern processors with single-instruction multiple-data (SIMD) instructions.

Existing parallel algorithms are based on modifications of the merge part of either the bitonic sorter or odd-even mergesort.

[9] In 2018, Saitoh M. et al. introduced MMS [10] for FPGAs, which focused on removing a multi-cycle feedback datapath that prevented efficient pipelining in hardware.

Also in 2018, Papaphilippou P. et al. introduced FLiMS [9] that improved the hardware utilization and performance by only requiring

pipeline stages of P/2 compare-and-swap units to merge with a parallelism of P elements per FPGA cycle.

Some computer languages provide built-in or library support for merging sorted collections.

The type of the elements merged must support the less-than (<) operator, or it must be provided with a custom comparator.