In computer science, the reduction operator[1] is a type of operator that is commonly used in parallel programming to reduce the elements of an array into a single result.
[2][3][4] The reduction of sets of elements is an integral part of programming models such as Map Reduce, where a reduction operator is applied (mapped) to all elements before they are reduced.
Many reduction operators can be used for broadcasting to distribute data to all processors.
A reduction operator stores the result of the partial tasks into a private copy of the variable.
and has to be stored at a specified root processor at the end of the execution.
An optimal sequential linear-time algorithm for reduction can apply the operator successively from front to back, always replacing two vectors with the result of the operation applied to all its elements, thus creating an instance that has one vector less.
Using a binary tree reduction would allow 4 cores to compute
The commutativity of the reduction operator would be important if there were a master core distributing work to several processors, since then the results could arrive back to the master processor in any order.
However, note that matrix multiplication is associative, and therefore the result would be correct as long as the proper ordering were enforced, as in the binary tree reduction technique.
Both models have different implications for the time-complexity, therefore two algorithms will be shown.
This algorithm represents a widely spread method to handle inputs where
In every iteration, half of the processing units become inactive and do not contribute to further computations.
The figure shows a visualization of the algorithm using addition as the operator.
as the fields are overwritten and reused for previously evaluated expressions.
For an Allreduce operation the result has to be distributed, which can be done by appending a broadcast from
The efficiency suffers because half of the active processing units become inactive after each step, so
In contrast to the PRAM-algorithm, in the distributed memory model, memory is not shared between processing units and data has to be exchanged explicitly between processing units.
The only difference between the distributed algorithm and the PRAM version is the inclusion of explicit communication primitives, the operating principle stays the same.
A simple analysis for the algorithm uses the BSP-model and incorporates the time
For distributed memory models, it can make sense to use pipelined communication.
Usually, linear pipelines split data or a tasks into smaller pieces and process them in stages.
In contrast to the binomial tree algorithms, the pipelined algorithm uses the fact that the vectors are not inseparable, but the operator can be evaluated for single elements:[10] It is important to note that the send and receive operations have to be executed concurrently for the algorithm to work.
The associated animation shows an execution of the algorithm on vectors of size four with five processing units.
steps until the last processing unit receives its first element and additional
has a fixed value, it is possible to logically group elements of a vector together and reduce
Reduction is one of the main collective operations implemented in the Message Passing Interface, where performance of the used algorithm is important and evaluated constantly for different use cases.
[11] Operators can be used as parameters for MPI_Reduce and MPI_Allreduce, with the difference that the result is available at one (root) processing unit or all of them.
OpenMP offers a reduction clause for describing how the results from parallel operations are collected together.
[12] MapReduce relies heavily on efficient reduction algorithms to process big data sets, even on huge clusters.
[13][14] Some parallel sorting algorithms use reductions to be able to handle very big data sets.