In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable.
A disadvantage of the variance for practical applications is that, unlike the standard deviation, its units differ from the random variable, which is why the standard deviation is more commonly reported as a measure of dispersion once the calculation is finished.
One, as discussed above, is part of a theoretical probability distribution and is defined by an equation.
There are multiple ways to calculate an estimate of the population variance, as discussed in the section below.
If an infinite number of observations are generated using a distribution, then the sample variance calculated from that infinite set will match the value calculated using the distribution's equation for variance.
This definition encompasses random variables that are generated by processes that are discrete, continuous, neither, or mixed.
A fair six-sided dice can be modeled as a discrete random variable, X, with outcomes 1 through 6, each with equal probability 1/6.
The population variance for a non-negative random variable can be expressed in terms of the cumulative distribution function F using
Using the linearity of the expectation operator and the assumption of independence (or uncorrelatedness) of X and Y, this further simplifies as follows:
Therefore, the variance of the mean of a large number of standardized variables is approximately equal to their average correlation.
The delta method uses second-order Taylor expansions to approximate the variance of a function of one or more random variables: see Taylor expansions for the moments of functions of random variables.
In this example, the sample would be the set of actual measurements of yesterday's rainfall from available rain gauges within the geography of interest.
Four common values for the denominator are n, n − 1, n + 1, and n − 1.5: n is the simplest (the variance of the sample), n − 1 eliminates bias,[10] n + 1 minimizes mean squared error for the normal distribution,[11] and n − 1.5 mostly eliminates bias in unbiased estimation of standard deviation for the normal distribution.
Correcting for bias often makes this worse: one can always choose a scale factor that performs better than the corrected sample variance, though the optimal scale factor depends on the excess kurtosis of the population (see mean squared error: variance) and introduces bias.
[15] Directly taking the variance of the sample data gives the average of the squared deviations:[16]
Their expected values can be evaluated by averaging over the ensemble of all possible samples {Yi} of size n from the population.
Either estimator may be simply referred to as the sample variance when the version can be determined by context.
The square root is a concave function and thus introduces negative bias (by Jensen's inequality), which depends on the distribution, and thus the corrected sample standard deviation (using Bessel's correction) is biased.
The unbiased sample variance is a U-statistic for the function f(y1, y2) = (y1 − y2)2/2, meaning that it is obtained by averaging a 2-sample statistic over 2-element subsets of the population.
In the case that Yi are independent observations from a normal distribution, Cochran's theorem shows that the unbiased sample variance S2 follows a scaled chi-squared distribution (see also: asymptotic properties and an elementary proof):[17]
If the conditions of the law of large numbers hold for the squared observations, S2 is a consistent estimator of σ2.
An asymptotically equivalent formula was given in Kenney and Keeping (1951:164), Rose and Smith (2002:264), and Weisstein (n.d.).
[25] The F-test of equality of variances and the chi square tests are adequate when the sample is normally distributed.
Resampling methods, which include the bootstrap and the jackknife, may be used to test the equality of variances.
[26] The covariance matrix is related to the moment of inertia tensor for multivariate distributions.
This difference between moment of inertia in physics and in statistics is clear for points that are gathered along a line.
The term variance was first introduced by Ronald Fisher in his 1918 paper The Correlation Between Relatives on the Supposition of Mendelian Inheritance:[28] The great body of available statistics show us that the deviations of a human measurement from its mean follow very closely the Normal Law of Errors, and, therefore, that the variability may be uniformly measured by the standard deviation corresponding to the square root of the mean square error.
When there are two independent causes of variability capable of producing in an otherwise uniform population distributions with standard deviations
The generalized variance can be shown to be related to the multidimensional scatter of points around their mean.
as the squared Euclidean distance between the random variable and its mean, or, simply as the scalar product of the vector