Collective operation

Collective operations are building blocks for interaction patterns, that are often used in SPMD algorithms in the parallel programming context.

A realization of the collective operations is provided by the Message Passing Interface[1] (MPI).

The broadcast pattern[3] is used to distribute data from one processing unit to all processing units, which is often needed in SPMD parallel programs to dispense input or global values.

One possibility is to utilize a binomial tree structure with the requirement that

The packets are then broadcast one after another, so that data is distributed fast in the communication network.

Pipelined broadcast on balanced binary tree is possible in

The reduce pattern[4] is used to collect data or partial results from different processing units and to combine them into a global result by a chosen operator.

Some algorithms require a commutative operator with a neutral element.

For pipelining on binary trees the message must be representable as a vector of smaller object for component-wise reduction.

Pipelined reduce on a balanced binary tree is possible in

For long messages a corresponding implementation is suitable, whereas for short messages, the latency can be reduced by using a hypercube (Hypercube (communication pattern) § All-Gather/ All-Reduce) topology, if

All-reduce can also be implemented with a butterfly algorithm and achieve optimal latency and bandwidth.

All-reduce implemented with a butterfly algorithm achieves the same asymptotic runtime.

The prefix-sum or scan operation[7] is used to collect data or partial results from different processing units and to compute intermediate results by an operator, which are stored on those processing units.

must be at least associative, whereas some algorithms require also a commutative operator and a neutral element.

In the case of the so-called exclusive prefix sum, processing unit

For long messages, the hypercube (Hypercube (communication pattern) § Prefix sum, Prefix sum § Distributed memory: Hypercube algorithm) topology is not suitable, since all processing units are active in every step and therefore pipelining can't be used.

Prefix-sum on a binary tree can be implemented with an upward and downward phase.

In the upward phase reduction is performed, while the downward phase is similar to broadcast, where the prefix sums are computed by sending different data to the left and right children.

Pipelined prefix sum on a binary tree is possible in

Barrier is thus used to achieve global synchronization in distributed computing.

Compare this to reduce where message size is a constant for operators like

It differs from broadcast, in that it does not send the same message to all processing units.

Instead it splits the message and delivers one part of it to each processing unit.

Assuming we have a fully connected network, the best possible runtime for all-to-all is in

.This table[12] gives an overview over the best known asymptotic runtimes, assuming we have free choice of network topology.

For each operation, the optimal algorithm can depend on the input sizes

For example, broadcast for short messages is best implemented using a binomial tree whereas for long messages a pipelined communication on a balanced binary tree is optimal.

Sanders, Peter; Mehlhorn, Kurt; Dietzfelbinger, Martin; Dementiev, Roman (2019).

Sequential and Parallel Algorithms and Data Structures - The Basic Toolbox.

There are three squares vertically aligned on the left and three squares vertically aligned on the right. A dotted line connects the high left and high right square. Two solid lines connect the high left square and the middle and low right square. The letter a is written in the high left square and in all right squares.
Information flow of Broadcast operation performed on three nodes.
There are three squares vertically aligned on the left and three squares vertically aligned on the right. A circle with the letter f inside is placed between the two columns. Three solid lines connect the circle with the left three squares. One solid line connects the circle and the high right square. The letters a, b and c are written in the left squares from high to low. The letter alpha is written in the top right square.
Information flow of Reduce operation performed on three nodes. f is the associative operator and α is the result of the reduction.
There are three squares vertically aligned on the left and three squares vertically aligned on the right. A circle with the letter f inside is placed between the two columns. Three solid lines connect the circle with the left three squares. One solid line connects the circle and the high right square. The letters a, b and c are written in the left squares from high to low. The letter alpha is written in the top right square.
Information flow of All-Reduce operation performed on three nodes. f is the associative operator and α is the result of the reduction.
There are three squares vertically aligned on the left and three rectangles vertically aligned on the right. A circle with the word scan inside is placed between the two columns. Three solid lines connect the circle with the left three squares. Three solid lines connect the circle with the three right square. The letters a, b and c are written in the left squares from high to low. In the high right square the letter a is written. In the mid right square the term a plus b is written. In the low right square the term a plus b plus c is written.
Information flow of Prefix-Sum/Scan operation performed on three nodes. The operator + can be any associative operator.
There are three squares vertically aligned on the left and three rectangles vertically aligned on the right. A dotted line connects the high left square with the high right rectangle. Two solid lines connect the mid and low left squares with the high right rectangle. The letters a, b and c are written in the left squares from high to low. The letters a, b and c are written in the high right rectangle in a row.
Information flow of Gather operation performed on three nodes.
There are three squares vertically aligned on the left and three rectangles vertically aligned on the right. Three dotted lines connect the high left square with the high right rectangle, the mid left square with the mid right rectangle and the low left square with the low right rectangle. Two solid lines connect the mid and low left squares with the high right rectangle. Two solid lines connect the high and low left squares with the mid right rectangle. Two solid lines connect the high and mid left squares with the low right rectangle. The letters a, b and c are written in the left squares from high to low. The letters a, b and c are written in all right rectangles in a row.
Information flow of All-Gather operation performed on three nodes.
There are three rectangles vertically aligned on the left and three squares vertically aligned on the right. A dotted line connects the high left rectangle with the high right square. Two solid lines connect the high left rectangle with the mid and low right squares. The letters c, b and a are written in the high left rectangle in a row. The letters a, b and c are written in the right right squares from high to low.
Information flow of Scatter operation performed on three nodes.
There are three rectangles vertically aligned on the left and three rectangles vertically aligned on the right. The rectangles are three time higher as wide. The terms a1, a2 and a3 are written in the high left rectangle one below the other. The terms b1, b2 and b3 are written in the mid left rectangle one below the other. The terms c1, c2 and c3 are written in the low left rectangle one below the other. The terms a1, b1 and c1 are written in the high right rectangle one below the other. The terms a2, b2 and c2 are written in the mid right rectangle one below the other. The terms a3, b3 and c3 are written in the low right rectangle one below the other. A dotted line connects a1 from the high left rectangle and a1 from the high right rectangle. A dotted line connects b2 from the mid left rectangle and b2 from the mid right rectangle. A dotted line connects c3 from the low left rectangle and c3 from the low right rectangle. Solid lines connect the other corresponding terms between the left and right rectangles.
Information flow of All-to-All operation performed on three nodes. Letters indicate nodes and numbers indicate information items.