Bulk synchronous parallel

In fact, quantifying the requisite synchronization and communication is an important part of analyzing a BSP algorithm.

The BSP model was developed by Leslie Valiant of Harvard University during the 1980s.

[1] Between 1990 and 1992, Leslie Valiant and Bill McColl of Oxford University worked on ideas for a distributed memory BSP programming model, in Princeton and at Harvard.

Between 1992 and 1997, McColl led a large research team at Oxford that developed various BSP programming libraries, languages and tools, and also numerous massively parallel BSP algorithms, including many early examples of high-performance communication-avoiding parallel algorithms [2] and recursive "immortal" parallel algorithms that achieve the best possible performance and optimal parametric tradeoffs.

[3] With interest and momentum growing, McColl then led a group from Oxford, Harvard, Florida, Princeton, Bell Labs, Columbia and Utrecht that developed and published the BSPlib Standard for BSP programming in 1996.

[5] In 2017, McColl developed a major new extension of the BSP model that provides fault tolerance and tail tolerance for large-scale parallel computations in AI, Analytics and high-performance computing (HPC).

[6] See also [7] A BSP computer consists of the following: This is commonly interpreted as a set of processors that may follow different threads of computation, with each processor equipped with fast local memory and interconnected by a communication network.

BSP algorithms rely heavily on the third feature; a computation proceeds in a series of global supersteps, which consists of three components: The computation and communication actions do not have to be ordered in time.

Communication typically takes the form of the one-sided PUT and GET remote direct memory access (RDMA) calls rather than paired two-sided send and receive message-passing calls.

Systems based on two-sided communication include this synchronization cost implicitly for every message sent.

The barrier synchronization method relies on the BSP computer's hardware facility.

In Valiant's original paper, this facility periodically checks if the end of the current superstep is reached globally.

[1] The BSP model is also well-suited for automatic memory management for distributed-memory computing through over-decomposition of the problem and oversubscription of the processors.

This strategy can be shown statistically to lead to almost perfect load balancing, both of work and communication.

In many parallel programming systems, communications are considered at the level of individual actions, such as sending and receiving a message or memory-to-memory transfer.

This is difficult to work with since there are many simultaneous communication actions in a parallel program, and their interactions are typically complex.

In particular, it is difficult to say much about the time any single communication action will take to complete.

The BSP model considers communication actions en masse.

This has the effect that an upper bound on the time taken to communicate a set of data can be given.

The maximum number of incoming or outgoing messages for a superstep is denoted by

The ability of a communication network to deliver data is captured by a parameter

The one-sided communication of the BSP model requires barrier synchronization.

Barriers also permit novel forms of fault tolerance[citation needed].

There is a large body of literature on removing synchronization points from existing algorithms in the context of BSP computing and beyond.

This drives the cost of global synchronization, compared to the minimally required latency of communication, to zero.

[8] Yet also this minimal latency is expected to increase further for future supercomputer architectures and network interconnects; the BSP model, along with other models for parallel computation, require adaptation to cope with this trend.

Interest in BSP has soared, with Google adopting it as a major technology for graph analytics at massive scale via Pregel and MapReduce.

The model has also been used in the creation of a number of new programming languages and interfaces, such as Bulk Synchronous Parallel ML (BSML), BSPLib, Apache Hama,[9] and Pregel.

[12] Modern implementations include BSPonMPI[13] (which simulates BSP on top of the Message Passing Interface), and MulticoreBSP[14][15] (a novel implementation targeting modern shared-memory architectures).

MulticoreBSP for C is especially notable for its capability of starting nested BSP runs, thus allowing for explicit Multi-BSP programming.

A BSP superstep. Processes lack linear order and may be mapped to processors in any way