No free lunch theorem

In mathematical folklore, the "no free lunch" (NFL) theorem (sometimes pluralized) of David Wolpert and William Macready, alludes to the saying "no such thing as a free lunch", that is, there are no easy shortcuts to success.

[1] Wolpert had previously derived no free lunch theorems for machine learning (statistical inference).

[2] In 2005, Wolpert and Macready themselves indicated that the first theorem in their paper "state[s] that any two optimization algorithms are equivalent when their performance is averaged across all possible problems".

[3] The "no free lunch" (NFL) theorem is an easily stated and easily understood consequence of theorems Wolpert and Macready actually prove.

Various investigators have extended the work of Wolpert and Macready substantively.

In terms of how the NFL theorem is used in the context of the research area, the no free lunch in search and optimization is a field that is dedicated for purposes of mathematically analyzing data for statistical identity, particularly search[4] and optimization.

If all histories are equally likely, then any prediction strategy will score the same, with the same accuracy rate of 0.5.

In their paper, they state: We have dubbed the associated results NFL theorems because they demonstrate that if an algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems.

is the conditional probability of obtaining a given sequence of cost values from algorithm

Here, blind search means that at each step of the algorithm, the element

is chosen at random with uniform probability distribution from the elements of

In essence, this says that when all functions f are equally likely, the probability of observing an arbitrary sequence of m values in the course of optimization does not depend upon the algorithm.

In the analytic framework of Wolpert and Macready, performance is a function of the sequence of observed values (and not e.g. of wall-clock time), so it follows easily that all algorithms have identically distributed performance when objective functions are drawn uniformly at random, and also that all algorithms have identical mean performance.

Theorem 2 establishes a similar, but "more subtle", NFL result for time-varying objective functions.

Rather uniform randomness was used as a tool, to compare the number of environments for which algorithm A outperforms algorithm B to the number of environments for which B outperforms A. NFL tells us that (appropriately weighted)[clarification needed] there are just as many environments in both of those sets.

In particular, there are just as many prior distributions (appropriately weighted) in which learning algorithm A beats B (on average) as vice versa.

[citation needed] This statement about sets of priors is what is most important about NFL, not the fact that any two algorithms perform equally for the single, specific prior distribution that assigns equal probability to all environments.

Though the NFL might seem contradictory to results from other papers suggesting generalization of learning algorithms or search heuristics, it is important to understand the difference between the exact mathematical logic of the NFL and its intuitive interpretation.

[9] To illustrate one of the counter-intuitive implications of NFL, suppose we fix two supervised learning algorithms, C and D. We then sample a target function f to produce a set of input-output pairs, d. The question is how should we choose whether to train C or D on d, in order to make predictions for what output would be associated with a point lying outside of d. It is common in almost all of science and statistics to answer this question – to choose between C and D – by running cross-validation on d with those two algorithms.

In other words, to decide whether to generalize from d with either C or D, we see which of them has better out-of-sample performance when tested within d. Since C and D are fixed, this use of cross-validation to choose between them is itself an algorithm, i.e., a way of generalizing from an arbitrary dataset.

In other words, we could choose between C and D based on which has worse out-of-sample performance within d. Again, since C and D are fixed, this use of anti-cross-validation is itself an algorithm.

With no assumptions, no "meta-algorithm", such as the scientific method, performs better than random choice.

[5][6][7] If Occam's razor is correct, for example if sequences of lower Kolmogorov complexity are more probable than sequences of higher complexity, then (as is observed in real life) some algorithms, such as cross-validation, perform better on average on practical problems (when compared with random choice or with anti-cross-validation).

[11] However, there are major formal challenges in using arguments based on Kolmogorov complexity to establish properties of the real world, since it is uncomputable, and undefined up to an arbitrary additive constant.

Partly in recognition of these challenges, it has recently been argued that there are ways to circumvent the no free lunch theorems without invoking Turing machines, by using "meta-induction".

[12][13] Moreover, the Kolmogorov complexity of machine learning models can be upper bounded through compressions of their data labeling, and it is possible to produce non-vacuous cross-domain generalization bounds via Kolmogorov complexity.