Estimation of covariance matrices

Simple cases, where observations are complete, can be dealt with by using the sample covariance matrix.

Cases involving missing data, heteroscedasticity, or autocorrelated residuals require deeper considerations.

Another issue is the robustness to outliers, to which sample covariance matrices are highly sensitive.

[2][3][4] Statistical analyses of multivariate data often involve exploratory studies of the way in which the variables change in relation to one another and this may be followed up by explicit statistical models involving the covariance matrix of the variables.

Thus the estimation of covariance matrices directly from observational data plays two roles: Estimates of covariance matrices are required at the initial stages of principal component analysis and factor analysis, and are also involved in versions of regression analysis that treat the dependent variables in a data-set, jointly with the independent variable as the outcome of a random sample.

This is true regardless of the distribution of the random variable X, provided of course that the theoretical means and covariances exist.

A well-known instance is when the random variable X is normally distributed: in this case the maximum likelihood estimator of the covariance matrix is slightly different from the unbiased estimate, and is given by A derivation of this result is given below.

This could lead to estimated correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix.

When estimating the cross-covariance of a pair of signals that are wide-sense stationary, missing samples do not need be random (e.g., sub-sampling by an arbitrary factor is valid).

[citation needed] A random vector X ∈ Rp (a p×1 "column vector") has a multivariate normal distribution with a nonsingular covariance matrix Σ precisely if Σ ∈ Rp × p is a positive-definite matrix and the probability density function of X is where μ ∈ Rp×1 is the expected value of X.

The covariance matrix Σ is the multidimensional analog of what in one dimension would be the variance, and normalizes the density

Based on the observed values x1, ..., xn of this sample, we wish to estimate Σ.

The likelihood function is: It is fairly readily shown that the maximum-likelihood estimate of the mean vector μ is the "sample mean" vector: See the section on estimation in the article on the normal distribution for details; the process here is similar.

is sometimes called the scatter matrix, and is positive definite if there exists a subset of the data consisting of

Then the expression above becomes The positive-definite matrix B can be diagonalized, and then the problem of finding the value of B that maximizes Since the trace of a square matrix equals the sum of eigenvalues ("trace and eigenvalues"), the equation reduces to the problem of finding the eigenvalues λ1, ..., λp that maximize This is just a calculus problem and we get λi = n for all i.

The random matrix S can be shown to have a Wishart distribution with n − 1 degrees of freedom.

It also verifies the aforementioned fact about the maximum likelihood estimate of the mean.

Re-write the likelihood in the log form using the trace trick: The differential of this log-likelihood is It naturally breaks down into the part related to the estimation of the mean, and to the part related to the estimation of the variance.

Dwyer[6] points out that decomposition into two terms such as appears above is "unnecessary" and derives the estimator in two lines of working.

Note that it may be not trivial to show that such derived estimator is the unique global maximizer for likelihood function.

Given a sample of n independent observations x1,..., xn of a p-dimensional zero-mean Gaussian random variable X with covariance R, the maximum likelihood estimator of R is given by The parameter

belongs to the set of positive-definite matrices, which is a Riemannian manifold, not a vector space, hence the usual vector-space notions of expectation, i.e. "

For complex Gaussian random variables, this bias vector field can be shown[1] to equal where and ψ(·) is the digamma function.

The intrinsic bias of the sample covariance matrix equals and the SCM is asymptotically unbiased as n → ∞.

Similarly, the intrinsic inefficiency of the sample covariance matrix depends upon the Riemannian curvature of the space of positive-definite matrices.

If the sample size n is small and the number of considered variables p is large, the above empirical estimators of covariance and correlation are very unstable.

As an alternative, many methods have been suggested to improve the estimation of the covariance matrix.

) can be shown to outperform the maximum likelihood estimator for small samples.

Apart from increased efficiency the shrinkage estimate has the additional advantage that it is always positive definite and well conditioned.

[11] Software for computing a covariance shrinkage estimator is available in R (packages corpcor[12] and ShrinkCovMat[13]), in Python (scikit-learn library [1]), and in MATLAB.