Algorithms for calculating variance

Algorithms for calculating variance play a major role in computational statistics.

A key difficulty in the design of good algorithms for this problem is that formulas for the variance may involve sums of squares, which can lead to numerical instability as well as to arithmetic overflow when dealing with large values.

A formula for calculating the variance of an entire population of size N is: Using Bessel's correction to calculate an unbiased estimate of the population variance from a finite sample of n observations, the formula is: Therefore, a naïve algorithm to calculate the estimated variance is given by the following: This algorithm can easily be adapted to compute the variance of a finite population: simply divide by n instead of n − 1 on the last line.

[3] This is particularly bad if the standard deviation is small relative to the mean.

The variance is invariant with respect to changes in a location parameter, a property which can be used to avoid the catastrophic cancellation in this formula.

In any case the second term in the formula is always smaller than the first one therefore no cancellation may occur.

the algorithm can be written in Python programming language as This formula also facilitates the incremental computation that can be expressed as An alternative approach, using a different formula for the variance, first computes the sample mean, and then computes the sum of the squares of the differences from the mean, where s is the standard deviation.

[1][4] However, the results of both of these simple algorithms ("naïve" and "two-pass") can depend inordinately on the ordering of the data and can give poor results for very large data sets due to repeated roundoff error in the accumulation of the sums.

Techniques such as compensated summation can be used to combat this error to a degree.

It is often useful to be able to compute the variance in a single pass, inspecting each value

only once; for example, when the data is being collected without enough storage to keep all the values, or when costs of memory access dominate those of computation.

The following formulas can be used to update the mean and (estimated) variance of the sequence, for an additional element xn.

These formulas suffer from numerical instability [citation needed], as they repeatedly subtract a small number from a big number which scales with n. A better quantity for updating is the sum of squares of differences from the current mean,

This algorithm is much less prone to loss of precision due to catastrophic cancellation, but might not be as efficient because of the division operation inside the loop.

The parallel algorithm below illustrates how to merge multiple sets of statistics calculated online.

: This may be useful when, for example, multiple processing units may be assigned to discrete parts of the input.

Chan's method for estimating the mean is numerically unstable when

This can be generalized to allow parallelization with AVX, with GPUs, and computer clusters, and to covariance.

[3] Assume that all floating point operations use standard IEEE 754 double-precision arithmetic.

While this loss of precision may be tolerable and viewed as a minor flaw of the naïve algorithm, further increasing the offset makes the error catastrophic.

Terriberry[11] extends Chan's formulae to calculating the third and fourth central moments, needed for example when estimating skewness and kurtosis: Here the

, only one division operation is needed and the higher-order statistics can thus be calculated for little incremental cost.

An example of the online algorithm for kurtosis implemented as described is: Pébaÿ[12] further extends these results to arbitrary-order central moments, for the incremental and the pairwise cases, and subsequently Pébaÿ et al.[13] for weighted and compound moments.

Choi and Sweetman[14] offer two alternative methods to compute the skewness and kurtosis, each of which can save substantial computer memory requirements and CPU time in certain applications.

One benefit is that the statistical moment calculations can be carried out to arbitrary accuracy such that the computations can be tuned to the precision of, e.g., the data storage format or the original measurement hardware.

A relative histogram of a random variable can be constructed in the conventional way: the range of potential values is divided into bins and the number of occurrences within each bin are counted and plotted such that the area of each rectangle equals the portion of the sample values within that bin: where

: The second approach from Choi and Sweetman[14] is an analytical methodology to combine statistical moments from individual segments of a time-history such that the resulting overall moments are those of the complete time-history.

sets can be combined by addition, and there is no upper limit on the value of

Even greater accuracy can be achieved by first computing the means, then using the stable one-pass algorithm on the residuals.

Thus the covariance can be computed as A small modification can also be made to compute the weighted covariance: Likewise, there is a formula for combining the covariances of two sets that can be used to parallelize the computation:[3] A version of the weighted online algorithm that does batched updated also exists: let