Loop unrolling

On modern processors, loop unrolling is often counterproductive, as the increased code size can cause more cache misses; cf.

[3] To eliminate this computational overhead, loops can be re-written as a repeated sequence of similar independent statements.

If an optimizing compiler or assembler is able to pre-calculate offsets to each individually referenced array variable, these can be built into the machine code instructions directly, therefore requiring no additional arithmetic operations at run time.

If this part of the program is to be optimized, and the overhead of the loop requires significant resources compared to those for the delete(x) function, unwinding can be used to speed it up.

Afterwards, only 20% of the jumps and conditional branches need to be taken, and represents, over many iterations, a potentially significant decrease in the loop administration overhead.

Computer programs easily track the combinations, but programmers find this repetition boring and make mistakes.

Even better, the "tweaked" pseudocode example, that may be performed automatically by some optimizing compilers, eliminating unconditional jumps altogether.

This flexibility is one of the advantages of just-in-time techniques versus static or manual optimization in the context of loop unrolling.

Assembly language programmers (including optimizing compiler writers) are also able to benefit from the technique of dynamic loop unrolling, using a method similar to that used for efficient branch tables.

This example is for IBM/360 or Z/Architecture assemblers and assumes a field of 100 bytes (at offset zero) is to be copied from array FROM to array TO—both having 50 entries with element lengths of 256 bytes each.In this example, approximately 202 instructions would be required with a "conventional" loop (50 iterations), whereas the above dynamic code would require only about 89 instructions (or a saving of approximately 56%).

The following example demonstrates dynamic loop unrolling for a simple program written in C. Unlike the assembler example above, pointer/index arithmetic is still generated by the compiler in this example because a variable (i) is still used to address the array element.