Normalized compression distance

Normalized compression distance (NCD) is a way of measuring the similarity between two objects, be it two documents, two letters, two emails, two music scores, two languages, two programs, two pictures, two systems, two genomes, to name a few.

A reasonable definition for the similarity between two objects is how difficult it is to transform them into each other.

It can be used in information retrieval and data mining for cluster analysis.

For technical reasons one uses the theoretical notion of Turing machines.

This information distance is shown to be a metric (it satisfies the metric inequalities up to a logarithmic additive term), is universal (it minorizes every computable distance as computed for example from features up to a constant additive term).

[1] The information distance is absolute, but if we want to express similarity, then we are more interested in relative ones.

has been shown to satisfy the basic requirements for a metric distance measure.

compressed with compressor Z (for example "gzip", "bzip2", "PPMZ") in order to make NID easy to apply.

[3] The normalized compression distance has been used to fully automatically reconstruct language and phylogenetic trees.

[2][3] It can also be used for new applications of general clustering and classification of natural data in arbitrary domains,[3] for clustering of heterogeneous data,[3] and for anomaly detection across domains.

[5] The NID and NCD have been applied to numerous subjects, including music classification,[3] to analyze network traffic and cluster computer worms and viruses,[6] authorship attribution,[7] gene expression dynamics,[8] predicting useful versus useless stem cells,[9] critical networks,[10] image registration,[11] question-answer systems.

[12] Researchers from the datamining community use NCD and variants as "parameter-free, feature-free" data-mining tools.

[5] One group have experimentally tested a closely related metric on a large variety of sequence benchmarks.

Comparing their compression method with 51 major methods found in 7 major data-mining conferences over the past decade, they established superiority of the compression method for clustering heterogeneous data, and for anomaly detection, and competitiveness in clustering domain data.

[15] These are measures that do not need to respect symmetry and triangle inequality distance properties.

Although the NCD and the NRC seem very similar, they address different questions.

Objects can also be given by name, like "the four-letter genome of a mouse," or "the text of `War and Peace' by Tolstoy."

There are also objects that cannot be given literally, but only by name, and that acquire their meaning from their contexts in background common knowledge in humankind, like "home" or "red."

Using code-word lengths obtained from the page-hit counts returned by Google from the web, we obtain a semantic distance using the NCD formula and viewing Google as a compressor useful for data mining, text comprehension, classification, and translation.

The associated NCD, called the normalized Google distance (NGD) can be rewritten as where