The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data.
The penalty discourages overfitting, which is desired because increasing the number of parameters in the model almost always improves the goodness of the fit.
Suppose that the data is generated by some unknown process f. We consider two candidate models to represent f: g1 and g2.
We cannot choose with certainty, because we do not know f. Akaike (1974) showed, however, that we can estimate, via AIC, how much more (or less) information is lost by g1 than by g2.
The estimate, though, is only valid asymptotically; if the number of data points is small, then some correction is often necessary (see AICc, below).
Then the quantity exp((AICmin − AICi)/2) can be interpreted as being proportional to the probability that the ith model minimizes the (estimated) information loss.
Indeed, if all the models in the candidate set have the same number of parameters, then using AIC might at first appear to be very similar to using the likelihood-ratio test.
In particular, the likelihood-ratio test is valid only for nested models, whereas AIC (and AICc) has no such restriction.
As an example of a hypothesis test, consider the t-test to compare the means of two normally-distributed populations.
This third model would have the advantage of not making such assumptions at the cost of an additional parameter and thus degree of freedom.
Statistical inference is generally regarded as comprising hypothesis testing and estimation.
[10] In other words, AIC can be used to form a foundation of statistics that is distinct from both frequentism and Bayesianism.
Assuming that the model is univariate, is linear in its parameters, and has normally-distributed residuals (conditional upon regressors), then the formula for AICc is as follows.
Thus, AICc is essentially AIC with an extra penalty term for the number of parameters.
That instigated the work of Hurvich & Tsai (1989), and several further papers by the same authors, which extended the situations in which AICc could be applied.
Indeed, minimizing AIC in a statistical model is effectively equivalent to maximizing entropy in a thermodynamic system; in other words, the information-theoretic approach in statistics is essentially applying the second law of thermodynamics.
Following is an illustration of how to deal with data transforms (adapted from Burnham & Anderson (2002, §2.11.3): "Investigators should be sure that all hypotheses are modeled using the same response variable").
To do that, we need to perform the relevant integration by substitution: thus, we need to multiply by the derivative of the (natural) logarithm function, which is 1/y.
[27] The critical difference between AIC and BIC (and their variants) is the asymptotic property under well-specified and misspecified model classes.
If the goal is selection, inference, or interpretation, BIC or leave-many-out cross-validations are preferred.
A comprehensive overview of AIC and other popular model selection methods is given by Ding et al. (2018)[30] The formula for the Bayesian information criterion (BIC) is similar to the formula for AIC, but with a different penalty for the number of parameters.
The authors show that AIC/AICc can be derived in the same Bayesian framework as BIC, just by using different prior probabilities.
Additionally, the authors present a few simulation studies that suggest AICc tends to have practical/performance advantages over BIC.
[31][32][33] Proponents of AIC argue that this issue is negligible, because the "true model" is virtually never in the candidate set.
Vrieze presents a simulation study—which allows the "true model" to be in the candidate set (unlike with virtually all real data).
The reason is that, for finite n, BIC can have a substantial risk of selecting a very bad model from the candidate set.
Yang additionally shows that the rate at which AIC converges to the optimum is, in a certain sense, the best possible.
With least squares fitting, the maximum likelihood estimate for the variance of a model's residuals distributions is where the residual sum of squares is Then, the maximum value of a model's log-likelihood function is (see Normal distribution#Log-likelihood): where C is a constant independent of the model, and dependent only on the particular data points, i.e. it does not change if the data does not change.
Leave-one-out cross-validation is asymptotically equivalent to AIC, for ordinary linear regression models.
[36] Mallows's Cp is equivalent to AIC in the case of (Gaussian) linear regression.