It does so efficiently in terms of execution time, memory usage, and inter-processor communication.
[2] It is employed in the scheduler for the Cilk programming language,[3] the Java fork/join framework,[4] the .NET Task Parallel Library,[5] and the Rust Tokio runtime.
Forks produce multiple logically parallel computations, variously called "threads"[2] or "strands".
[9][note 1] As an example, consider the following trivial fork–join program in Cilk-like syntax: The function call f(1, 2) gives rise to the following computation graph:
The work of a scheduler, now, is to assign the computations (edges) to processors in a way that makes the entire computation run to completion in the correct order (as constrained by the join nodes), preferably as fast as possible.
The randomized version of the work stealing algorithm presented by Blumofe and Leiserson maintains several threads of execution and schedules these onto
Any processor that becomes idle starts the actual process of work stealing, which means the following: Note that, in the rule for spawn, Blumofe and Leiserson suggest that the "parent" thread execute its new thread, as if performing a function call (in the C-like program f(x); g(y);, the function call to f completes before the call to g is performed).
[8] Child stealing is used by Threading Building Blocks, Microsoft's Task Parallel Library and OpenMP, although the latter gives the programmer control over which strategy is used.
The randomized variant due to Blumofe and Leiserson executes a parallel computation in expected time
[10] A localized variant, in which a processor attempts to steal back its own work whenever it is free, has also been analyzed theoretically and practically.
were the stack usage of the same computation on a single processor,[2] fitting the authors' own earlier definition of space efficiency.
[13] This bound requires continuation stealing; in a child stealing scheduler, it does not hold, as can be seen from the following example:[8] In a child-stealing implementation, all "forked" calls to f are put in a work queue that thus grows to size n, which can be made arbitrarily large.
[14][15] A variant of work stealing has been devised for this situation, which executes a computation in expected time
[16] The multiprogramming work-scheduler differs from the traditional version in two respects: Attempts to improve on the multiprogramming work stealer have focused on cache locality issues[12] and improved queue data structures.
[17] Several scheduling algorithms for dynamically multithreaded computations compete with work stealing.
Besides the traditional work sharing approach, there is a scheduler called parallel depth-first (PDF) that improves on the space bounds of work stealing,[18] as well giving better performance in some situations where the cores of a chip multiprocessor share a cache.