[3][4][5] Abstractly, a prefix sum requires only a binary associative operator ⊕, making it useful for many applications from calculating well-separated pair decompositions of points to string processing.
An inclusive scan includes input xi when computing output yi (i.e.,
In the latter case, implementations either leave y0 undefined or accept a separate "x−1" value with which to seed the scan.
[8] The following table lists examples of the inclusive and exclusive scan functions provided by a few programming languages and libraries: The directive-based OpenMP parallel programming model supports both inclusive and exclusive scan support beginning with Version 5.0.
Hillis and Steele present the following parallel prefix sum algorithm:[9] In the above, the notation
[3][11][12] If the input sequence has n steps, then the recursion continues to a depth of O(log n), which is also the bound on the parallel running time of this algorithm.
[13] The idea of building in hardware a functional unit dedicated to computing multi-parameter prefix-sum was patented by Uzi Vishkin.
More specifically, multiple algorithms exist which are adapted for platforms working on shared memory as well as algorithms which are well suited for platforms using distributed memory, relying on message passing as the only form of interprocess communication.
A version of this algorithm is implemented in the Multi-Core Standard Template Library (MCSTL),[15][16] a parallel implementation of the C++ standard template library which provides adapted versions for parallel computing of various algorithms.
Improvement: In case that the number of blocks are too much that makes the serial step time-consuming by deploying a single processor, the Hillis and Steele algorithm can be used to accelerate the second phase.
The Hypercube Prefix Sum Algorithm[17] is well adapted for distributed memory platforms and works with the exchange of messages between the processing elements.
processor elements (PEs) participating in the algorithm equal to the number of corners in a d-dimensional hypercube.
Throughout the algorithm, each PE is seen as a corner in a hypothetical hyper cube with knowledge of the total prefix sum
of two adjacent PEs in different hyper cubes can be exchanged in both directions in one communication step, this means
The infix numeration ensures that for any given PEj, the indices of all nodes reachable by its left subtree
This allows for the following reasoning: Note the distinction between subtree-local and total prefix sums.
which is favourable for large message sizes n. The algorithm can further be optimised by making use of full-duplex or telephone model communication and overlapping the upward and the downward phase.
[1] List ranking, the problem of transforming a linked list into an array that represents the same sequence of items, can be viewed as computing a prefix sum on the sequence 1, 1, 1, ... and then mapping each item to the array position given by its prefix sum value; by combining list ranking, prefix sums, and Euler tours, many important problems on trees may be solved by efficient parallel algorithms.
By using a circuit that performs the operations of the parallel prefix sum algorithm, it is possible to design an adder that uses O(n) logic gates and O(log n) time steps.
By means of a sorting network, a set of parallel memory access requests can be ordered into a sequence such that accesses to the same cell are contiguous within the sequence; scan operations can then be used to determine which of the accesses succeed in writing to their requested cells, and to distribute the results of memory read operations to multiple processors that request the same result.
[24] In the construction of Gray codes, sequences of binary values with the property that consecutive sequence values differ from each other in a single bit position, a number n can be converted into the Gray code value at position n of the sequence simply by taking the exclusive or of n and n/2 (the number formed by shifting n right by a single bit position).
A prefix sum of this type may be performed efficiently using the bitwise Boolean operations available on modern computers, by computing the exclusive or of x with each of the numbers formed by shifting x to the left by a number of bits that is a power of two.
In particular, it can be used to compute the divided difference coefficients of the Newton form of the interpolation polynomial.
[26] This prefix based approach can also be used to obtain the generalized divided differences for (confluent) Hermite interpolation as well as for parallel algorithms for Vandermonde systems.
This allows parallel prefix algorithms to be applied to compute the filtering and smoothing solutions.
[30][31] Here, the idea is that we can define an associative operator for a combination of conditional value functions (conditioned on the end-point), and the prefix sums of this operator give solutions to the Bellman equations or HJB equations.
The algorithms uses an array of weights representing the amount of work required for each item.
After the prefix sum is calculated, the work item i is sent for processing to the processor unit with the number [ prefixSumValuei/ totalWork / numberOfProcessors ].
[32] Graphically this corresponds to an operation where the amount of work in each item is represented by the length of a linear segment, all segments are sequentially placed onto a line and the result cut into number of pieces, corresponding to the number of the processors.
[33] Below is a lookup table of quarter squares with the remainder discarded for the digits 0 through 18; this allows for the multiplication of numbers up to 9×9.