Some variants bypass higher levels of the cache hierarchy, which is useful in a 'streaming' context for data that is traversed once, rather than held in the working set.
An example is load last in the PowerPC instruction set, which suggests that data will only be used once, i.e., the cache line in question may be pushed to the head of the eviction queue, whilst keeping it in use if still directly needed.
In recent times, cache control instructions have become less popular as increasingly advanced application processor designs from Intel and ARM devote more transistors to accelerating code written in traditional languages, e.g., performing automatic prefetch, with hardware to detect linear access patterns on the fly.
However the techniques may remain valid for throughput-oriented processors, which have a different throughput vs latency tradeoff, and may prefer to devote more area to execution units.
These allow greater control over memory traffic and locality (as the working set is managed by explicit transfers), and eliminates the need for expensive cache coherency in a manycore machine.
It is very hard to adapt programs written in traditional languages such as C and C++ which present the programmer with a uniform view of a large address space (which is an illusion simulated by caches).
Vector processors (for example modern graphics processing unit (GPUs) and Xeon Phi) use massive parallelism to achieve high throughput whilst working around memory latency (reducing the need for prefetching).