Memory disambiguation

The mechanisms for performing memory disambiguation, implemented using digital logic inside the microprocessor core, detect true dependencies between memory operations at execution time and allow the processor to recover when a dependence has been violated.

In this case, the dependence is static and easily determined by a microprocessor, because the sources and destinations are registers.

Consider the following code snippet, given in MIPS assembly: Assume that the scheduling logic will issue an instruction to the execution unit when all of its register operands are ready.

Further, assume that registers $30 and $31 are ready: the values in $30 and $31 were computed a long time ago and have not changed.

Finally, assume that registers $30 and $31 hold the same value, and thus all the loads and stores in the snippet access the same memory word.

Values from store instructions are not committed to the memory system (in modern microprocessors, CPU cache) when they execute.

The processor must "throw away" any results from the bad-path, speculatively-executed instructions when it discovers the exception or branch misprediction.

Forcing stores to wait until branch directions and exceptions are known significantly reduces the out-of-order aggressiveness and limits ILP (Instruction level parallelism) and performance.

Buffering stores until retirement avoids WAW and WAR dependencies but introduces a new issue.

To solve this problem, processors employ a technique called store-to-load forwarding using the store queue.

This technique allows loads to obtain correct data if their producer store has completed but not yet retired.

If such a matching load exists, it must have executed before the store and thus read an incorrect, old value from the memory system/store queue.

However, it does not require any additional memory (cache) ports or create resource conflicts with other loads or stores that are executing.

With this technique, load instructions that have executed out-of-order are re-executed (they access the memory system and read the value from their address a second time) when they reach the retirement point.

Some recent work (Roth 2005) has shown ways to filter many loads from re-executing if it is known that no RAW dependence violation could have occurred; such a technique would help or eliminate such latency and resource contention.

A minor benefit of this scheme (compared to a load-queue search) is that it will not flag a RAW dependence violation and trigger a pipeline flush if a store that would have caused a RAW dependence violation (the store's address matches an in-flight load's address) has a data value that matches the data value already in the cache.

CPUs that fully support out-of-order execution of loads and stores must be able to detect RAW dependence violations when they occur.

Another possibility is to effectively break loads and stores into two operations: address generation and cache access.

This scheme still allows for some "out-of-orderness" — the address generation operations for any in-flight loads and stores can execute out-of-order, and once addresses have been generated, the cache accesses for each load or store can happen in any order that respects the (now known) true dependences.