Loop-level parallelism

Where a sequential program will iterate over the data structure and operate on indices one at a time, a program exploiting loop-level parallelism will use multiple threads or processes which operate on some or all of the indices at the same time.

Such parallelism provides a speedup to overall execution time of the program, typically in line with Amdahl's law.

However, many algorithms are designed to run sequentially, and fail when parallel processes race due to dependence within the code.

Sequential algorithms are sometimes applicable to parallel contexts with slight modification.

Consider the following code operating on a list L of length n. Each iteration of the loop takes the value from the current index of L, and increments it by 10.

Now, consider a system with p processors where p > n. If n threads run in parallel, the time to execute all n steps is reduced to T. Less simple cases produce inconsistent, i.e. non-serializable outcomes.

Consider the following loop operating on the same list L. Each iteration sets the current index to be the value of the previous plus ten.

With multiple threads, process scheduling and other considerations prevent the execution order from guaranteeing an iteration will execute only after its dependence is met.

Serializability can be restored by adding synchronization to preserve the dependence on previous iterations.

Anti-Dependence and Output Dependence can be dealt with by giving each process its own copy of variables (known as privatization).

[1] S2 ->T S3, meaning that S2 has a true dependence on S3 because S2 writes to the variable a, which S3 reads from.

S2 ->A S3, meaning that S2 has an anti-dependence on S3 because S2 reads from the variable b before S3 writes to it.

Each iteration may be treated as a block and performed in parallel without other synchronization efforts.

In the following example code used for swapping the values of two array of length n, there is a loop-independent dependence of S1 ->T S3.

Each implementation varies slightly in how threads synchronize, if at all.

In addition, parallel tasks must somehow be mapped to a process.

Research has shown that load-balancing can be better achieved through some dynamic allocation algorithms than when done statically.

[4] The process of parallelizing a sequential program can be broken down into the following discrete steps.

Statements that are not dependent on each other are separated so that these distributed loops can be executed in parallel.

, Now because we split the two statements and put them in two different loops, gives us an execution time of

DOALL parallelism exists when statements within a loop can be executed independently (situations where there is no loop-carried dependence).

, Now because DOALL Parallelism exists when all iterations are independent, speed-up may be achieved by executing all iterations in parallel which gives us an execution time of

The following example, using a simplified pseudo code, shows how a loop might be parallelized to execute each iteration independently.

[5] Synchronization exists to enforce loop-carried dependence.

Each loop iteration performs two actions Calculating the value a[i-1] + b[i] + 1, and then performing the assignment can be decomposed into two lines(statements S1 and S2): The first line, int tmp = b[i] + 1;, has no loop-carried dependence.

DOPIPE Parallelism implements pipelined parallelism for loop-carried dependence where a loop iteration is distributed over multiple, synchronized loops.