CPU cache

The cache entry will include the copied data as well as the requested memory location (called a tag).

When the processor needs to read or write a location in memory, it first checks for a corresponding entry in the cache.

Thus knowing how well the cache is able to bridge the gap in the speed of processor and memory becomes important, especially in high-performance systems.

As CPUs become faster compared to main memory, stalls due to cache misses displace more potential computation; modern CPUs can execute hundreds of instructions in the time taken to fetch a single cache line from main memory.

However, increasing associativity more than four does not improve hit rate as much,[12] and are generally done for other reasons (see virtual aliasing).

That cache entry can be read, and the processor can continue to work with that data before it finishes checking that the tag actually matches the requested address.

A subset of the tag, called a hint, can be used to pick just one of the possible cache entries mapping to the requested address.

This double cache indexing is called a "major location mapping", and its latency is equivalent to a direct-mapped access.

Extensive experiments in multicolumn cache design[17] shows that the hit ratio to major locations is as high as 90%.

(The tag, flag and error correction code bits are not included in the size,[22] although they do affect the physical area of a cache.)

Some systems also set a valid bit to "invalid" at other times, such as when multi-master bus snooping hardware in the cache of one processor hears an address broadcast from some other processor, and realizes that certain data blocks in the local cache are now stale and should be marked invalid.

However, the latter approach does not help against the synonym problem, in which several cache lines end up storing data for the same physical address.

This issue may be solved by using non-overlapping memory layouts for different address spaces, or otherwise the cache (or a part of it) must be flushed when the mapping changes.

It has not been used recently, as the hardware cost of detecting and evicting virtual aliases has fallen and the software complexity and performance penalty of perfect page coloring has risen.

Also, during miss processing, the alternate ways of the cache line indexed have to be probed for virtual aliases and any matches evicted.

Effectively, the hardware maintains a simple permutation from virtual address to cache index, so that no content-addressable memory (CAM) is necessary to select the right one of the four ways fetched.

To understand the problem, consider a CPU with a 1 MiB physically indexed direct-mapped level-2 cache and 4 KiB virtual memory pages.

The software page coloring technique has been used to effectively partition the shared Last level Cache (LLC) in multicore processors.

Examples of products incorporating L3 and L4 caches include the following: Finally, at the other end of the memory hierarchy, the CPU register file itself can be considered the smallest, fastest cache in the system, with the special characteristic that it is scheduled in software—typically by a compiler, as it allocates registers to hold values retrieved from main memory for, as an example, loop nest optimization.

There was also a set of 64 address "B" and 64 scalar data "T" registers that took longer to access, but were faster than main memory.

Typically, sharing the L1 cache is undesirable because the resulting increase in latency would make each core run considerably slower than a single-core chip.

It is, however, possible for a line in the data cache to have a PTE which is also in one of the TLBs—the operating system is responsible for keeping the TLBs coherent by flushing portions of them when the page tables in memory are updated.

Stores are not guaranteed to show up in the instruction stream until a program calls an operating system facility to ensure coherency.

In computer engineering, a tag RAM is used to specify which of the possible memory locations is currently stored in a CPU cache.

A great deal of design effort, and often power and silicon area are expended making the caches as fast as possible.

Some SPARC designs have improved the speed of their L1 caches by a few gate delays by collapsing the virtual address adder into the SRAM decoders.

As the x86 microprocessors reached clock rates of 20 MHz and above in the 386, small amounts of fast cache memory began to be featured in systems to improve performance.

The cache was constructed from more expensive, but significantly faster, SRAM memory cells, which at the time had latencies around 10–25 ns.

[67][68] The popularity of on-motherboard cache continued through the Pentium MMX era but was made obsolete by the introduction of SDRAM and the growing disparity between bus clock rates and CPU clock rates, which caused on-motherboard cache to be only slightly faster than main memory.

Crystalwell[38] Haswell CPUs, equipped with the GT3e variant of Intel's integrated Iris Pro graphics, effectively feature 128 MiB of embedded DRAM (eDRAM) on the same package.

Motherboard of a NeXTcube computer (1990). At the lower edge of the image left from the middle, there is the CPU Motorola 68040 operated at 25 MHz with two separate level 1 caches of 4 KiB each on the chip, one for the instructions and one for data. The board has no external L2 cache.
An illustration of different ways in which memory locations can be cached by particular cache locations
Memory hierarchy of an AMD Bulldozer server
Cache hierarchy of the K8 core in the AMD Athlon 64 CPU
Read path for a 2-way associative cache
Example of a motherboard with an i386 microprocessor (33 MHz), 64 KiB cache (25 ns; 8 chips in the bottom left corner), 2 MiB DRAM (70 ns; 8 SIMMs to the right of the cache), and a cache controller ( Austek A38202; to the right of the processor)