Direct memory access

[1] Without DMA, when the CPU is using programmed input/output, it is typically fully occupied for the entire duration of the read or write operation, and is thus unavailable to perform other work.

[2] To carry out an input, output or memory-to-memory operation, the host processor initializes the DMA controller with a count of the number of words to transfer, and the memory address to use.

Some examples of buses using third-party DMA are PATA, USB (before USB4), and SATA; however, their host controllers use bus mastering.

However, in cycle stealing mode, after one unit of data transfer, the control of the system bus is deasserted to the CPU via BG.

In transparent mode, the DMA controller transfers data only when the CPU is performing operations that do not use the system buses.

Imagine a CPU equipped with a cache and an external memory that can be accessed directly by devices using DMA.

Similarly, if the cached copy of X is not invalidated when a device writes a new value to the memory, then the CPU will operate on a stale value of X.

The latter approach introduces some overhead to the DMA operation, as most hardware requires a loop to invalidate each cache line individually.

This prevented it from being used as a general-purpose "Blitter", and consequently block memory moves in the PC, limited by the general PIO speed of the CPU, were very slow.

With the IBM PC/AT, the enhanced AT bus (more familiarly retronymed as the Industry Standard Architecture (ISA)) added a second 8237 DMA controller to provide three additional, and as highlighted by resource clashes with the XT's additional expandability over the original PC, much-needed channels (5–7; channel 4 is used as a cascade to the first 8237).

For compatibility, the lower four DMA channels were still limited to 8-bit transfers only, and whilst memory-to-memory transfers were now technically possible due to the freeing up of channel 0 from having to handle DRAM refresh, from a practical standpoint they were of limited value because of the controller's consequent low throughput compared to what the CPU could now achieve (i.e., a 16-bit, more optimised 80286 running at a minimum of 6 MHz, vs an 8-bit controller locked at 4.77 MHz).

Due to their lagging performance (1.6 MB/s maximum 8-bit transfer capability at 5 MHz,[4] but no more than 0.9 MB/s in the PC/XT and 1.6 MB/s for 16-bit transfers in the AT due to ISA bus overheads and other interference such as memory refresh interruptions[1]) and unavailability of any speed grades that would allow installation of direct replacements operating at speeds higher than the original PC's standard 4.77 MHz clock, these devices have been effectively obsolete since the late 1980s.

Particularly, the advent of the 80386 processor in 1985 and its capacity for 32-bit transfers (although great improvements in the efficiency of address calculation and block memory moves in Intel CPUs after the 80186 meant that PIO transfers even by the 16-bit-bus 286 and 386SX could still easily outstrip the 8237), as well as the development of further evolutions to (EISA) or replacements for (MCA, VLB and PCI) the "ISA" bus with their own much higher-performance DMA subsystems (up to a maximum of 33 MB/s for EISA, 40 MB/s MCA, typically 133 MB/s VLB/PCI) made the original DMA controllers seem more of a performance millstone than a booster.

The pieces of legacy hardware that continued to use ISA DMA after 32-bit expansion buses became common were Sound Blaster cards that needed to maintain full hardware compatibility with the Sound Blaster standard; and Super I/O devices on motherboards that often integrated a built-in floppy disk controller, an IrDA infrared controller when FIR (fast infrared) mode is selected, and an IEEE 1284 parallel port controller when ECP mode is selected.

In cases where an original 8237s or direct compatibles were still used, transfer to or from these devices may still be limited to the first 16 MB of main RAM regardless of the system's actual address space or amount of installed memory.

As an example of DMA engine incorporated in a general-purpose CPU, some Intel Xeon chipsets include a DMA engine called I/O Acceleration Technology (I/OAT), which can offload memory copying from the main CPU, freeing it to do other work.

[8] In 2006, Intel's Linux kernel developer Andrew Grover performed benchmarks using I/OAT to offload network traffic copies and found no more than 10% improvement in CPU utilization with receiving workloads.

As a result, DDIO reduces the overall I/O processing latency, allows processing of the I/O to be performed entirely in-cache, prevents the available RAM bandwidth/latency from becoming a performance bottleneck, and may lower the power consumption by allowing RAM to remain longer in low-powered state.

A master interface can be used by the device to perform DMA transactions to/from system memory without heavily loading the CPU.

Internally, a multichannel DMA engine is usually present in the device to perform multiple concurrent scatter-gather operations as programmed by the software.

Thus the DMA acts as a primary means of data transfer among cores inside this CPU (in contrast to cache-coherent CMP architectures such as Intel's cancelled general-purpose GPU, Larrabee).

According to an experiment, an effective peak performance of DMA in Cell (3 GHz, under uniform traffic) reaches 200 GB per second.