Computation of a cyclic redundancy check is derived from the mathematics of polynomial division, modulo two.
Division of this type is efficiently realised in hardware by a modified shift register,[1] and in software by a series of equivalent algorithms, starting with simple code close to the mathematics and becoming faster (and arguably more obfuscated[2]) through byte-wise parallelism and space–time tradeoffs.
Various CRC standards extend the polynomial division algorithm by specifying an initial shift register value, a final Exclusive-Or step and, most critically, a bit ordering (endianness).
As a result, the code seen in practice deviates confusingly from "pure" division,[2] and the register may shift left or right.
As an example of implementing polynomial division in hardware, suppose that we are trying to compute an 8-bit CRC of an 8-bit message made of the ASCII character "W", which is binary 010101112, decimal 8710, or hexadecimal 5716.
This is just like decimal long division, but even simpler because the only possible multiples at each step are 0 and 1, and the subtractions borrow "from infinity" instead of reducing the upper digits.
is equivalent to shifting the register by one place, as the coefficients do not change in value but only move up to the next term of the polynomial.
Note that this example code avoids the need to specify a bit-ordering convention by not using bytes; the input bitString is already in the form of a bit array, and the remainderPolynomial is manipulated in terms of polynomial operations; the multiplication by
The second problem could be solved by doing the last n iterations differently, but there is a more subtle optimization which is used universally, in both hardware and software implementations.
Because the XOR operation used to subtract the generator polynomial from the message is commutative and associative, it does not matter in what order the various inputs are combined into the remainderPolynomial.
And specifically, a given bit of the bitString does not need to be added to the remainderPolynomial until the very last instant when it is tested to determine whether to xor with the generatorPolynomial.
This eliminates the need to preload the remainderPolynomial with the first n bits of the message, as well: This is the standard bit-at-a-time hardware CRC implementation, and is well worthy of study; once you understand why this computes exactly the same result as the first version, the remaining optimizations are quite straightforward.
In software, it is convenient to note that while one may delay the xor of each bit until the very last moment, it is also possible to do it earlier.
Here, we take the input in 8-bit bytes: This is usually the most compact software implementation, used in microcontrollers when space is at a premium over speed.
This is because a CRC's ability to detect burst errors is based on proximity in the message polynomial
On the other hand, floppy disks and most hard drives write the most significant bit of each byte first.
The lsbit-first CRC is slightly simpler to implement in software, so is somewhat more commonly seen, but many programmers find the msbit-first bit ordering easier to follow.
In either case, be sure to transmit the bytes of the CRC in the order that matches your chosen bit-ordering convention.
Faster software implementations process more than one bit of dividend per iteration using lookup tables, indexed by highest order coefficients of rem, to memoize the per-bit division steps.
One popular technique is to use the bit-at-a-time code 256 times to generate the CRCs of the 256 possible 8-bit bytes.
[5] An alternate source is the W3C webpage on PNG, which includes an appendix with a short and simple table-driven implementation in C of CRC-32.
We wish to compute a CRC two bytes (16 bits) at a time, but the standard table-based approach would require an inconveniently large 65536-entry table.
In the part of the basic Sarwate algorithm where the previous CRC value is shifted by the size of the table lookup, the previous CRC value is shifted away entirely (what remains is all zero), so the XOR can be eliminated from the critical path.
If coded carefully (to avoid creating a false data dependency), half of the slice table loads can begin before the previous loop iteration has completed.
The result is enough work to keep the processor's memory subsystem continuously busy, which achieves maximum performance.
This technique is normally used in high-speed hardware implementations, but is practical in software for small or sparse CRC polynomials.
Thus, a 123-bit shift register can be advanced 8 bits per iteration using only two-input XOR gates, the fastest possible.
In fact, a few protocols use the CRC as the message delimiter, a technique called CRC-based framing.
But if the message being transmitted does care about leading 0 bits, the inability of the basic CRC algorithm to detect such a change is undesirable.
Note that a one-pass CRC generate/check will still produce a result of zero when the message is correct, regardless of the preset value.