Branch predictor

Branch predictors play a critical role in achieving high performance in many modern pipelined microprocessor architectures.

The branch predictor attempts to avoid this waste of time by trying to guess whether the conditional jump is most likely to be taken or not taken.

Modern microprocessors tend to have quite long pipelines so that the misprediction delay is between 10 and 20 clock cycles.

As a result, making a pipeline longer increases the need for a more advanced branch predictor.

Only when the branch or jump is evaluated and found to be taken, does the instruction pointer get set to a non-sequential address.

Both CPUs evaluate branches in the decode stage and have a single cycle instruction fetch.

Both architectures define branch delay slots in order to utilize these fetched instructions.

[1] Using a random or pseudorandom bit (a pure guess) would guarantee every branch a 50% correct prediction rate, which cannot be improved (or worsened) by reordering instructions.

Assuming for simplicity, a uniform distribution of branch targets, 0.5, 1.5, and 3.5 instructions fetched are discarded, respectively.

The discarded instructions at the branch and destination lines add up to nearly a complete fetch cycle, even for a single-cycle next-line predictor.

The original, non-MMX Intel Pentium processor uses a saturating counter, though with an imperfect implementation.

[8] On the SPEC'89 benchmarks, very large bimodal predictors saturate at 93.5% correct, once every branch maps to a unique counter.

In this case, entry number 00 in the pattern history table will go to state "strongly taken", indicating that after two zeroes comes a one.

[8] The advantage of the two-level adaptive predictor is that it can quickly learn to predict an arbitrary repetitive pattern.

[citation needed] A two-level branch predictor where the second level is replaced with a neural network has been proposed.

[14] A local branch predictor has a separate history buffer for each conditional jump instruction.

[citation needed] Predictors like gshare use multiple table entries to track the behavior of any particular branch.

[17][18] Newer processors from Intel[19] and AMD[20] can predict indirect branches by using a two-level adaptive predictor.

[27] The main advantage of the neural predictor is its ability to exploit long histories while requiring only linear resource growth.

Even after taking advantage of high-speed arithmetic tricks, the computation latency is relatively high compared to the clock period of many modern microarchitectures.

Many other researchers developed this concept (A. Seznec, M. Monchiero, D. Tarjan & K. Skadron, V. Desmet, Akkary et al., K. Aasaraai, Michael Black, etc.).

[34] The Stretch designers had considered static hint bits in the branch instructions early in the project but decided against them.

Two-bit predictors were introduced by Tom McWilliams and Curt Widdoes in 1977 for the Lawrence Livermore National Lab S-1 supercomputer and independently by Jim Smith in 1979 at CDC.

[35] Microprogrammed processors, popular from the 1960s to the 1980s and beyond, took multiple cycles per instruction, and generally did not require branch prediction.

The B4900 branch prediction history state is stored back into the in-memory instructions during program execution.

Because they use branch delay slots, fetched just one instruction per cycle, and execute in-order, there is no performance loss.

Branch prediction became more important with the introduction of pipelined superscalar processors like the Intel Pentium, DEC Alpha 21064, the MIPS R8000, and the IBM POWER series.

As a result, it has effectively very large base and choice predictor tables, and parity rather than ECC on instructions in the L2 cache.

The Alpha 21464[37] (EV8, cancelled late in design) had a minimum branch misprediction penalty of 14 cycles.

In 2018 a catastrophic security vulnerability called Spectre was made public by Google's Project Zero and other researchers.