Superscalar processor

The superscalar technique is traditionally associated with several identifying characteristics (within a given CPU): Seymour Cray's CDC 6600 from 1964, while not capable of issuing multiple instructions per cycle, is often cited as an early influence to modern superscalar processors for its ability to execute instructions simultaneously through multiple functional units.

The P5 Pentium was the first superscalar x86 processor; the Nx586, P6 Pentium Pro and AMD K5 were among the first designs which decode x86-instructions asynchronously into dynamic microcode-like micro-op sequences prior to actual execution on a superscalar microarchitecture; this opened up for dynamic scheduling of buffered partial instructions and enabled more parallelism to be extracted compared to the more rigid methods used in the simpler P5 Pentium; it also simplified speculative execution and allowed higher clock frequencies compared to designs such as the advanced Cyrix 6x86.

Superscalar CPU design emphasizes improving the instruction dispatcher accuracy and allowing it to keep the multiple execution units in use at all times.

If the dispatcher is ineffective at keeping all of these units fed with instructions, the performance of the system will be no better than that of a simpler, cheaper design.

Most modern superscalar CPUs also have logic to reorder the instructions to try to avoid pipeline stalls and increase parallel execution.

Available performance improvement from superscalar techniques is limited by three key areas: Existing binary executable programs have varying degrees of intrinsic parallelism.

No matter how advanced the semiconductor process or how fast the switching speed, this places a practical limit on how many instructions can be simultaneously dispatched.

Collectively the power consumption, complexity and gate delay costs limit the achievable superscalar speedup.

With VLIW, the burdensome task of dependency checking by hardware logic at run time is removed and delegated to the compiler.

SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures.