Latency oriented processor architecture

These architectures, in general, aim to execute as many instructions as possible belonging to a single serial thread, in a given window of time; however, the time to execute a single instruction completely from fetch to retire stages may vary from a few cycles to even a few hundred cycles in some cases.

[a] This technique is used to effectively increase the total register file size than that specified in the ISA to programmers, and to eliminate false dependencies.

The less time the processor spends waiting for data to be fetched from memory, the lower number of instructions consume pipeline resources while just sitting idle and doing no useful work.

The instruction pipeline will be completely stalled if all its internal buffers (for example reservation stations) are filled to their respective capacities.

[c] A major cause for pipeline stalls are control flow dependencies, i.e. when the outcome of a branch instruction is not known in advance (which is usually the case).

If the guess turns out to be correct, then the instructions are allowed to complete successfully and to update their results back to register file/memory.

If the guess was incorrect, then all speculative instructions are flushed from the pipeline and execution (re)starts along the actual correct path for the program.

By maintaining a high prediction accuracy, the pipeline is able to significantly increase throughput for the executing thread.

All instructions are registered in a re-order buffer when they are fetched by the pipeline and allowed to retire (i.e. write back their results) in the order of the original program so as to maintain correctness.

[f] In contrast, a throughput oriented processor architecture is designed to maximize the amount of 'useful work' done in a significant window of time.

Throughput oriented architectures on the other hand, usually have a multitude of processors with much smaller caches and simpler control logic.