Formally, it is the variance of the score, or the expected value of the observed information.
The role of the Fisher information in the asymptotic theory of maximum-likelihood estimation was emphasized and explored by the statistician Sir Ronald Fisher (following some initial results by Francis Ysidro Edgeworth).
The Fisher information matrix is used to calculate the covariance matrices associated with maximum-likelihood estimates.
In Bayesian statistics, the Fisher information plays a role in the derivation of non-informative prior distributions according to Jeffreys' rule.
[1] It also appears as the large-sample covariance of the posterior distribution, provided that the prior is sufficiently smooth (a result known as Bernstein–von Mises theorem, which was anticipated by Laplace for exponential families).
If log f(x; θ) is twice differentiable with respect to θ, and under certain additional regularity conditions, then the Fisher information may also be written as[7] since and Thus, the Fisher information may be seen as the curvature of the support curve (the graph of the log-likelihood).
In this case, even though the Fisher information can be computed from the definition, it will not have the properties it is typically assumed to have.
If there are n samples and the corresponding n distributions are statistically independent then the Fisher information will necessarily be the sum of the single-sample Fisher information values, one for each single sample from its distribution.
Van Trees (1968) and Frieden (2004) provide the following method of deriving the Cramér–Rao bound, a result which describes use of the Fisher information.
By rearranging, the inequality tells us that In other words, the precision to which we can estimate θ is fundamentally limited by the Fisher information of the likelihood function.
If it is positive definite, then it defines a Riemannian metric[11] on the N-dimensional parameter space.
[16] Orthogonal parameters are easy to deal with in the sense that their maximum likelihood estimates are asymptotically uncorrelated.
[18] Examples of singular statistical models include the following: normal mixtures, binomial mixtures, multinomial mixtures, Bayesian networks, neural networks, radial basis functions, hidden Markov models, stochastic context-free grammars, reduced rank regressions, Boltzmann machines.
In machine learning, if a statistical model is devised so that it extracts hidden structure from a random phenomenon, then it naturally becomes singular.
denotes the trace of a square matrix, and: Note that a special, but very common, case is the one where
Another special case occurs when the mean and covariance depend on two different vector parameters, say, β and θ.
This is especially popular in the analysis of spatial data, which often uses a linear model with correlated residuals.
In this form, it is clear that the Fisher information matrix is a Riemannian metric, and varies correctly under a change of variables.
In information geometry, this is seen as a change of coordinates on a Riemannian manifold, and the intrinsic properties of curvature are unchanged under different parametrizations.
[27] In the thermodynamic context, the Fisher information matrix is directly related to the rate of change in the corresponding order parameters.
[28] In particular, such relations identify second-order phase transitions via divergences of individual elements of the Fisher information matrix.
This is like how, of all bounded sets with a given volume, the sphere has the smallest surface area.
is the "derivative" of the volume of the effective support set, much like the Minkowski-Steiner formula.
Traditionally, statisticians have evaluated estimators and designs by considering some summary statistic of the covariance matrix (of an unbiased estimator), usually with positive real values (like the determinant or matrix trace).
Working with positive real numbers brings several advantages: If the estimator of a single parameter has a positive variance, then the variance and the Fisher information are both positive real numbers; hence they are members of the convex cone of nonnegative real numbers (whose nonzero members have reciprocals in this same cone).
This cone is closed under matrix addition and inversion, as well as under the multiplication of positive real numbers and matrices.
In that case, X is typically the joint responses of many neurons representing a low dimensional variable θ (such as a stimulus parameter).
[34] The Fisher information is used in machine learning techniques such as elastic weight consolidation,[35] which reduces catastrophic forgetting in artificial neural networks.
Fisher information can be used as an alternative to the Hessian of the loss function in second-order gradient descent network training.
, one may expand the previous expression in a series up to second order: But the second order derivative can be written as Thus the Fisher information represents the curvature of the relative entropy of a conditional distribution with respect to its parameters.