Sample mean and covariance

The sample mean is the average value (or mean value) of a sample of numbers taken from a larger population of numbers, where "population" indicates not number of people but the entirety of relevant data, whether collected or not.

The sample mean is used as an estimator for the population mean, the average value in the entire population, where the estimate is more likely to be close to the population mean if the sample is large and representative.

The reliability of the sample mean is estimated using the standard error, which in turn is calculated using the variance of the sample.

The term "sample mean" can also be used to refer to a vector of average values when the statistician is looking at the values of several variables in the sample, e.g. the sales, profits, and employees of a sample of Fortune 500 companies.

In this case, there is not just a sample variance for each variable but a sample variance-covariance matrix (or simply covariance matrix) showing also the relationship between each pair of variables.

The sample covariance is useful in judging the reliability of the sample means as estimators and is also useful as an estimate of the population covariance matrix.

Due to their ease of calculation and other desirable characteristics, the sample mean and sample covariance are widely used in statistics to represent the location and dispersion of the distribution of values in the sample, and to estimate the values for the population.

be the ith independently drawn observation (i=1,...,N) on the jth random variable (j=1,...,K).

These observations can be arranged into N column vectors, each with K entries, with the K×1 column vector giving the i-th observations of all variables being denoted

is the average value of the N observations of the jth variable: Thus, the sample mean vector contains the average of the observations for each variable, and is written The sample covariance matrix is a K-by-K matrix

In terms of the observation vectors, the sample covariance is Alternatively, arranging the observation vectors as the columns of a matrix, so that which is a matrix of K rows and N columns.

Here, the sample covariance matrix can be computed as where

If the observations are arranged as rows instead of columns, so

is an N×K matrix whose column j is the vector of N observations on variable j, then applying transposes in the appropriate places yields Like covariance matrices for random vector, sample covariance matrices are positive semi-definite.

Furthermore, a covariance matrix is positive definite if and only if the rank of the

vectors is K. The sample mean and the sample covariance matrix are unbiased estimates of the mean and the covariance matrix of the random vector

, a row vector whose jth element (j = 1, ..., K) is one of the random variables.

due to a variant of Bessel's correction: In short, the sample covariance relies on the difference between each observation and the sample mean, but the sample mean is slightly correlated with each observation since it is defined in terms of all observations.

is known, the analogous unbiased estimate using the population mean, has

This is an example of why in probability and statistics it is essential to distinguish between random variables (upper case letters) and realizations of the random variables (lower case letters).

The maximum likelihood estimate of the covariance for the Gaussian distribution case has N in the denominator as well.

The ratio of 1/N to 1/(N − 1) approaches 1 for large N, so the maximum likelihood estimate approximately approximately equals the unbiased estimate when the sample is large.

For each random variable, the sample mean is a good estimator of the population mean, where a "good" estimator is defined as being efficient and unbiased.

Of course the estimator will likely not be the true value of the population mean since different samples drawn from the same distribution will give different sample means and hence different estimates of the true mean.

Thus the sample mean is a random variable, not a constant, and consequently has its own distribution.

(the arithmetic mean of a sample of values drawn from the population) makes a good estimator of the population mean, as its expected value is equal to the population mean (that is, it is an unbiased estimator).

The sample mean is a random variable, not a constant, since its calculated value will randomly differ depending on which members of the population are sampled, and consequently it will have its own distribution.

This is a consequence of the central limit theorem.

(each set of single observations on each of the K random variables) is assigned a weight

As robustness is often a desired trait, particularly in real-world applications, robust alternatives may prove desirable, notably quantile-based statistics such as the sample median for location,[4] and interquartile range (IQR) for dispersion.