Index of coincidence

What makes the IC especially useful is the fact that its value does not change if both texts are scrambled by the same single-alphabet substitution cipher, allowing a cryptanalyst to quickly detect that form of encryption.

We can express the index of coincidence IC for a given letter-frequency distribution as a summation: where N is the length of the text and n1 through nc are the frequencies (as integers) of the c letters of the alphabet (c = 26 for monocase English).

The sum of the ni is necessarily N. The products n(n − 1) count the number of combinations of n elements taken two at a time.

(Actually this counts each pair twice; the extra factors of 2 occur in both numerator and denominator of the formula and thus cancel out.)

The actual monographic IC for telegraphic English text is around 1.73, reflecting the unevenness of natural-language letter distributions.

Sometimes values are reported without the normalizing denominator, for example 0.067 = 1.73/26 for English; such values may be called κp ("kappa-plaintext") rather than IC, with κr ("kappa-random") used to denote the denominator 1/c (which is the expected coincidence rate for a uniform distribution of the same alphabet, 0.0385=1/26 for English).

For a repeating-key polyalphabetic cipher arranged into a matrix, the coincidence rate within each column will usually be highest when the width of the matrix is a multiple of the key length, and this fact can be used to determine the key length, which is the first step in cracking the system.

In effect, the new alphabet produced by the substitution is just a uniform renaming of the original character identities, which does not affect whether they match.

The same principle applies to real languages like English, because certain letters, like E, occur much more frequently than other letters—a fact which is used in frequency analysis of substitution ciphers.

Nevertheless, this technique can be used effectively to identify when two texts are likely to contain meaningful information in the same language using the same alphabet, to discover periods for repeating keys, and to uncover many other kinds of nonrandom phenomena within or among ciphertexts.

Expected values for various languages[6] are: The above description is only an introduction to use of the index of coincidence, which is related to the general concept of correlation.

for a single alphabet by the observed bulge for the message, although in many cases (such as when a repeating key was used) better techniques are available.

As a practical illustration of the use of I.C., suppose that we have intercepted the following ciphertext message: (The grouping into five characters is just a telegraphic convention and has nothing to do with actual word lengths.)

Suspecting this to be an English plaintext encrypted using a Vigenère cipher with normal A–Z components and a short repeating keyword, we can consider the ciphertext "stacked" into some number of columns, for example seven: If the key size happens to have been the same as the assumed number of columns, then all the letters within a single column will have been enciphered using the same key letter, in effect a simple Caesar cipher applied to a random selection of English plaintext characters.

On the other hand, if we have incorrectly guessed the key size (number of columns), the aggregate delta I.C.

If the actual size is five, we would expect a width of ten to also report a high I.C., since each of its columns also corresponds to a simple Caesar encipherment, and we confirm this.

"XX" are evidently "null" characters used to pad out the final group for transmission.

This entire procedure could easily be packaged into an automated algorithm for breaking such ciphers.

Due to normal statistical fluctuation, such an algorithm will occasionally make wrong choices, especially when analyzing short ciphertext messages.