Delay slot

This makes the instruction execute out-of-order compared to its location in the original assembler language code.

Modern processor designs generally do not use delay slots, and instead perform ever more complex forms of branch prediction.

In these systems, the CPU immediately moves on to what it believes will be the correct side of the branch and thereby eliminates the need for the code to specify some unrelated instruction, which may not always be obvious at compile-time.

This occurs rarely enough that the speed up of avoiding the delay slot is easily made up by the smaller number of wrong decisions.

A central processing unit generally performs instructions from the machine code using a four-step process; the instruction is first read from memory, then decoded to understand what needs to be performed, those actions are then executed, and finally, any results are written back to memory.

In early designs, each of these stages was performed in series, so that instructions took some multiple of the machine's clock cycle to complete.

The much simpler instruction set architecture (ISA) of the MOS 6502 allowed a two-stage pipeline to be included, which gave it performance that was about double that of the Z80 at any given clock speed.

[4] A major issue with the implementation of pipelines in early systems was that instructions had widely varying cycle counts.

For instance, the instruction to add two values would often be offered in multiple versions, or opcodes, which varied on where they read in the data.

This is known as a "pipeline stall" or "bubble", and, depending on the number of branches in the code, can have a noticeable impact on overall performance.

More advanced solutions would instead try to identify another instruction, typically nearby in the code, to place in the delay slot so that useful work would be accomplished.

The compilers generally have a limited "window" to examine and may not find a suitable instruction in that range of code.

Another side effect is that special handling is needed when managing breakpoints on instructions as well as stepping while debugging within the branch delay slot.

Software compatibility requirements dictate that an architecture may not change the number of delay slots from one generation to the next.

[8] The following example shows delayed branches in assembly language for the SHARC DSP including a pair after the RTS instruction.