Minimum description length

MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occam's razor.

The MDL principle can be extended to other forms of inductive inference and learning, for example to estimation and sequential prediction, without explicitly identifying a single model of the data.

Prior to the advent of computer programming, generating such descriptions was the intellectual labor of scientific theorists.

If two scientists had a theoretic disagreement, they rarely could formally apply Occam's razor to choose between their theories.

Nevertheless, science advanced as Occam's razor was an informal guide in deciding which model was best.

With the advent of formal languages and computer programming Occam's razor was mathematically defined.

Occam's razor could then formally select the shortest program, measured in bits of this algorithmic information, as the best model.

To avoid confusion, note that there is nothing in the MDL principle that implies a machine produced the program embodying the model.

The MDL principle applies regardless of whether the description to be run on a computer is the product of humans, machines or any combination thereof.

The MDL principle requires only that the shortest description, when executed, produce the original data set without error.

In statistical MDL learning, such a description is frequently called a two-part code.

Learning occurs when an algorithm generates a shorter description of the same data set.

The theoretic minimum description length of a data set, called its Kolmogorov complexity, cannot, however, be computed.

Nevertheless, given two programs that output the dataset, the MDL principle selects the shorter of the two as embodying the best model.

Shortly before his death, Marvin Minsky came out strongly in favor of this line of research, saying:[4] It seems to me that the most important discovery since Gödel was the discovery by Chaitin, Solomonoff and Kolmogorov of the concept called Algorithmic Probability which is a fundamental new theory of how to make predictions given a collection of experiences and this is a beautiful theory, everybody should learn it, but it’s got one problem, that is, that you cannot actually calculate what this theory predicts because it is too hard, it requires an infinite amount of work.

Everybody should learn all about that and spend the rest of their lives working on it.Any set of data can be represented by a string of symbols from a finite (say, binary) alphabet.

Over the past 40 years this has developed into a rich theory of statistical and machine learning procedures with connections to Bayesian model selection and averaging, penalization methods such as Lasso and Ridge, and so on - Grünwald and Roos (2020)[6] give an introduction including all modern developments.

Usually though, standard statistical methods assume that the general form of a model is fixed.

The basic idea is then to consider the (lossless) two-stage code that encodes data

; in the simplest context this just means "encoding the deviations of the data from the predictions made by

There are various types of universal codes one could use, often giving similar lengths for long data sequences but differing for short ones.

The 'best' (in the sense that it has a minimax optimality property) are the normalized maximum likelihood (NML) or Shtarkov codes.

For exponential families of distributions, when Jeffreys prior is used and the parameter space is suitably restricted, these asymptotically coincide with the NML codes; this brings MDL theory in close contact with objective Bayes model selection, in which one also sometimes adopts Jeffreys' prior, albeit for different reasons.

Therefore, the conclusion when following an MDL approach is inevitably that there is not enough evidence to support the hypothesis of the biased coin, even though the best element of the second model class provides better fit to the data.

Central to MDL theory is the one-to-one correspondence between code length functions and probability distributions (this follows from the Kraft–McMillan inequality).

An example is the Shtarkov normalized maximum likelihood code, which plays a central role in current MDL theory, but has no equivalent in Bayesian inference.

][10] In the last mentioned reference Rissanen bases the mathematical underpinning of MDL on the Kolmogorov structure function.

According to the MDL philosophy, Bayesian methods should be dismissed if they are based on unsafe priors that would lead to poor results.

The priors that are acceptable from an MDL point of view also tend to be favored in so-called objective Bayesian analysis; there, however, the motivation is usually different.

[11] Rissanen's was not the first information-theoretic approach to learning; as early as 1968 Wallace and Boulton pioneered a related concept called minimum message length (MML).