All-to-all (parallel pattern)

In parallel computing, all-to-all (also known as index operation or total exchange) is a collective operation, where each processor sends an individual message to every other processor.

The number of communication rounds and the overall communication volume are measures to evaluate the quality of an all-to-all algorithm.

Optimum for both these measures can not be achieved simultaneously.

[1] Depending on the network topology (fully connected, hypercube, ring), different all-to-all algorithms are required.

The way the data is routed through the network depends on its underlying topology.

We take a look at all-to-all algorithms for common network topologies.

A hypercube is a network topology, where two processors share a link, if the hamming distance of their indices is one.

The idea of an all-to-all algorithm is to combine messages belonging to the same subcube, and then distribute them.

An all-to-all algorithm in a ring topology is very intuitive.

Initially a processor sends a message of size m(p-1) to one of its neighbors.

When a processor receives a message, it extracts the part that belongs to it and forwards the remainder of the message to the next neighbor.

After (p-1) communication rounds, every message is distributed to its destination.

This way, messages arrive earlier at their destination.

An all-to-all algorithm in a mesh consists of two communication phases.

Messages are in the same group, if their destined processors share the same row.

After another all-to-all operation, this time in respect to columns, each processor ends up with its messages.

Additionally, time for the local rearrangement of messages adds to the overall runtime of the algorithm.

A trivial algorithm, is to send (p-1) asynchronous messages into the network for each processor.

The performance of this algorithm is poor, which is due to congestion arising because of the bisection width of the network.

[3] More sophisticated algorithms combine messages to reduce the number of send operations and try to control congestion.

It is faster to send messages directly to their destination.

In case p is odd, one processor is idle in each iteration.

Depending on the underlying topology of the network, one approach might be superior to the other.

The exclusive or approach is superior, when performing pairwise one-to-one routings in a hypercube or fat-tree.