Likelihood function

In contrast, in Bayesian statistics, the estimate of interest is the converse of the likelihood, the so-called posterior probability of the parameter given the observed data, which is calculated via Bayes' rule.

, is usually defined differently for discrete and continuous probability distributions (a more general definition is discussed below).

The fact that the likelihood function can be defined in a way that includes contributions that are not commensurate (the density and the probability mass) arises from the way in which the likelihood function is defined up to a constant of proportionality, where this "constant" can change with the observation

These conditions are assumed in various proofs involving likelihood functions, and need to be verified in each particular application.

Mäkeläinen and co-authors prove this result using Morse theory while informally appealing to a mountain pass property.

That is, a model that does not meet these regularity conditions may or may not have a maximum likelihood estimator of the properties mentioned above.

Further, in case of non-independently or non-identically distributed observations additional properties may need to be assumed.

[15] The asymptotic distribution of the log-likelihood ratio, considered as a test statistic, is given by Wilks' theorem.

Since the actual value of the likelihood function depends on the sample, it is often convenient to work with a standardized measure.

[23][24] These approaches are also useful when a high-dimensional likelihood surface needs to be reduced to one or two parameters of interest in order to allow a graph.

can be determined explicitly, concentration reduces computational burden of the original maximization problem.

Since graphically the procedure of concentration is equivalent to slicing the likelihood surface along the ridge of values of the nuisance parameter

Sometimes we can remove the nuisance parameters by considering a likelihood based on only part of the information in the data, for example by using the set of ranks rather than the numerical values.

But for practical purposes it is more convenient to work with the log-likelihood function in maximum likelihood estimation, in particular since most common probability distributions—notably the exponential family—are only logarithmically concave,[34][35] and concavity of the objective function plays a key role in the maximization.

In addition to the mathematical convenience from this, the adding process of log-likelihood has an intuitive interpretation, as often expressed as "support" from the data.

Thus, the graph has a direct interpretation in the context of maximum likelihood estimation and likelihood-ratio tests.

, known as Fisher information, determines the curvature of the likelihood surface,[40] and thus indicates the precision of the estimate.

Each of these terms has an interpretation,[a] but simply switching from probability to likelihood and taking logarithms yields the sum:

Thus for example the maximum likelihood estimate can be computed by taking derivatives of the sufficient statistic T and the log-partition function A.

[42] Its formal use to refer to a specific function in mathematical statistics was proposed by Ronald Fisher,[43] in two research papers published in 1921[44] and 1922.

Whereas, however, in relation to psychological judgment, likelihood has some resemblance to probability, the two concepts are wholly distinct.

Knowing the population we can express our incomplete knowledge of, or expectation of, the sample in terms of probability; knowing the sample we can express our incomplete knowledge of the population in terms of likelihood.

[47]Fisher's invention of statistical likelihood was in reaction against an earlier form of reasoning called inverse probability.

A. W. F. Edwards (1972) established the axiomatic basis for use of the log-likelihood ratio as a measure of relative support for one hypothesis against another.

[51][52][53][54][55] Due to the introduction of a probability structure on the parameter space or on the collection of models, it is possible that a parameter value or a statistical model have a large likelihood value for given data, and yet have a low probability, or vice versa.

The specific calculation of the likelihood is the probability that the observed sample would be assigned, assuming that the model chosen and the values of the several parameters θ give an accurate approximation of the frequency distribution of the population that the observed sample was drawn from.

Heuristically, it makes sense that a good choice of parameters is those which render the sample actually observed the maximum possible post-hoc probability of having happened.

Wilks' theorem quantifies the heuristic rule by showing that the difference in the logarithm of the likelihood generated by the estimate's parameter values and the logarithm of the likelihood generated by population's "true" (but unknown) parameter values is asymptotically χ2 distributed.

Successive estimates from many independent samples will cluster together with the population's "true" set of parameter values hidden somewhere in their midst.

The χ2 distribution given by Wilks' theorem converts the region's log-likelihood differences into the "confidence" that the population's "true" parameter set lies inside.

Figure 1.  The likelihood function ( ) for the probability of a coin landing heads-up (without prior knowledge of the coin's fairness), given that we have observed HH.
Figure 2.  The likelihood function ( ) for the probability of a coin landing heads-up (without prior knowledge of the coin's fairness), given that we have observed HHT.