Electrical or magnetic interference inside a computer system can cause a single bit of dynamic random-access memory (DRAM) to spontaneously flip to the opposite state.
It was initially thought that this was mainly due to alpha particles emitted by contaminants in chip packaging material, but research has shown that the majority of one-off soft errors in DRAM chips occur as a result of background radiation, chiefly neutrons from cosmic ray secondaries, which may change the contents of one or more memory cells or interfere with the circuitry used to read or write to them.
[2] Hence, the error rates increase rapidly with rising altitude; for example, compared to sea level, the rate of neutron flux is 3.5 times higher at 1.5 km and 300 times higher at 10–12 km (the cruising altitude of commercial airplanes).
As an example, the spacecraft Cassini–Huygens, launched in 1997, contained two identical flight recorders, each with 2.5 gigabits of memory in the form of arrays of commercial DRAM chips.
Recent studies[5] show that single-event upsets due to cosmic radiation have been dropping dramatically with process geometry, and previous concerns over increasing bit cell error rates are unfounded.
[6] A memory error can have no consequences if it changes a bit which neither causes observable malfunctioning nor affects data used in calculations or saved.
[9][10] An example of a single-bit error that would be ignored by a system with no error-checking, would halt a machine with parity checking or be invisibly corrected by ECC: a single bit is stuck at 1 due to a faulty chip, or becomes changed to 1 due to background or cosmic radiation; a spreadsheet storing numbers in ASCII format is loaded, and the character "8" (decimal value 56 in the ASCII encoding) is stored in the byte that contains the stuck bit at its lowest bit position; then, a change is made to the spreadsheet and it is saved.
Chipkill ECC is a more effective version that also corrects for multiple bit errors, including the loss of an entire memory chip.
[11] Later, he included parity in the CDC 7600, which caused pundits to remark that "apparently a lot of farmers buy computers".
Implicitly, it is assumed that the failure of each bit in a word of memory is independent, resulting in improbability of two simultaneous errors.
This weakness is addressed by various technologies, including IBM's Chipkill, Sun Microsystems' Extended ECC, Hewlett-Packard's Chipspare, and Intel's Single Device Data Correction (SDDC).
DRAM memory may provide increased protection against soft errors by relying on error-correcting codes.
Some systems also "scrub" the memory, by periodically reading all addresses and writing back corrected versions if necessary to remove soft errors.
[14] Modern desktop and server CPUs integrate the EDAC circuit into the CPU,[22] even before the shift toward CPU-integrated memory controllers, which are related to the NUMA architecture.
[26][27] Many CPUs use error-correction codes in the on-chip cache, including the Intel Itanium, Xeon, Core and Pentium (since P6 microarchitecture)[28][29] processors, the AMD Athlon, Opteron, all Zen-[30] and Zen+-based[31] processors (EPYC, EPYC Embedded, Ryzen and Ryzen Threadripper), and the DEC Alpha 21264.
[34] However, modern systems integrate ECC testing into the CPU, generating no additional delay to memory accesses as long as no errors are detected.
[22][35][36] This is not the case for in-band ECC, which stores tables used for protection in a reserved region of main system memory,[37][38] supported by Intel for Chromebooks, which showed little impact on web browsing and productivity tasks, but caused up to a 25% reduction in gaming and video editing benchmarks.
[39] ECC supporting memory may contribute to additional power consumption due to error-correcting circuitry.