Out-of-order execution

The first machine to use out-of-order execution was the CDC 6600 (1964), designed by James E. Thornton, which uses a scoreboard to avoid conflicts.

[5] About two years later, the IBM System/360 Model 91 (1966) introduced register renaming with Tomasulo's algorithm,[6] which dissolves false dependencies (WAW and WAR), making full out-of-order execution possible.

[8] Only the floating-point registers of the Model 91 are renamed, making it subject to the same WAW and WAR limitations as the CDC 6600 when running fixed-point calculations.

The 91 and 6600 both also suffer from imprecise exceptions, which needed to be solved before out-of-order execution could be applied generally and made practical outside supercomputers.

The CDC Cyber 990 (1984) implements precise interrupts by using a history buffer, which holds the old (overwritten) values of registers that are restored when an exception necessitates the reverting of instructions.

[11] In the 1980s many early RISC microprocessors, like the Motorola 88100, had out-of-order writeback to the registers, resulting in imprecise exceptions.

However, the single-cycle execution of the most basic instructions greatly reduced the scope of the problem compared to the CDC 6600.

Smith also researched how to make different execution units operate more independently of each other and of the memory, front-end, and branching.

In his 1984 paper, he opined that enforcing the precise exceptions only on the integer/memory pipeline should be sufficient for many use cases, as it even permits virtual memory.

To further decouple the memory access from execution, each of the two pipelines was associated with two addressable queues that effectively performed limited register renaming.

It uses a physical register file (i.e. a dynamically remapped file with both uncommitted and committed values) instead of a reorder buffer, but the ability to cancel instructions is needed only in the branch unit, which implements a history buffer (named program counter stack by IBM) to undo changes to count, link, and condition registers.

[16] The A19 of Unisys' A-series of mainframes was also released in 1991 and was claimed to have out-of-order execution, and one analyst called the A19's technology three to five years ahead of the competition.

[17][18] The first superscalar single-chip processors (Intel i960CA in 1989) used a simple scoreboarding scheduling like the CDC 6600 had a quarter of a century earlier.

In 1992–1996 a rapid advancement of techniques, enabled by increasing transistor counts, saw proliferation down to personal computers.

While stores and branches were waiting to start execution, subsequent instructions of other types could keep flowing through all the pipeline stages, including writeback.

[28] In the fall of 1994 NexGen and IBM with Motorola brought the renaming of general-purpose registers to single-chip CPUs.

NexGen's Nx586 was the first x86 processor capable of out-of-order execution and featured a reordering distance of up to 14 micro-operations.

The same year Cyrix 6x86 and AMD K5 brought advanced reordering techniques into mainstream personal computers.

Later big in-order processors were focused on multithreaded performance, but eventually the SPARC T series and Xeon Phi changed to out-of-order execution in 2011 and 2016 respectively.

The buffer's purpose is to partition the memory access and execute functions in a computer program and achieve high performance by exploiting the fine-grain parallelism between the two.

However, if the processor has a branch misprediction then the entire buffer may need to be flushed, wasting a lot of clock cycles and reducing the effectiveness.

[42] Control intensive code include such things as nested branches that occur frequently in operating system kernels.