Automatic vectorization

Automatic vectorization, in parallel computing, is a special case of automatic parallelization, where a computer program is converted from a scalar implementation, which processes a single pair of operands at a time, to a vector implementation, which processes one operation on multiple pairs of operands at once.

For example, modern conventional computers, including specialized supercomputers, typically have vector operations that simultaneously perform operations such as the following four additions (via SIMD or SPMD hardware): However, in most programming languages one typically writes loops that sequentially perform additions of many numbers.

[citation needed] Early computers usually had one logic unit, which executed one instruction on one pair of operands at a time.

So, many optimizing compilers perform automatic vectorization, where parts of sequential programs are transformed into parallel operations.

However, these transformations must be done safely, in order to ensure that the dependence between all statements remain true to the original.

The correct vector instruction must be chosen based on the size and behavior of the internal integers.

Floating-point precision must be kept as well, unless IEEE-754 compliance is turned off, in which case operations will be faster but the results may vary slightly.

Suppose the vector size is the same as 4 ints: Using the graph, the optimizer can then cluster the strongly connected components (SCC) and separate vectorizable statements from the rest.

The following code can easily be vectorized at compile time, as it doesn't have any dependence on external parameters.

A quick run-time check on the address of both a and b, plus the loop iteration space (128) is enough to tell if the arrays overlap or not, thus revealing any dependencies.

This technique, used for conventional vector machines, tries to find and exploit SIMD parallelism at the loop level.

This relatively new technique specifically targets modern SIMD architectures with short vector lengths.

The presence of if-statements in the loop body requires the execution of instructions in all control paths to merge the multiple values of a variable.

The more complex the control flow becomes and the more instructions are bypassed in the scalar code, the larger the vectorization overhead becomes.

Consider an example where the outer branch in the scalar baseline is always taken, bypassing most instructions in the loop body.

In most C and C++ compilers, it is possible to use intrinsic functions to manually vectorise, at the expense of programmer effort and maintainability.