Mahalanobis distance

[1] The mathematical details of Mahalanobis distance first appeared in the Journal of The Asiatic Society of Bengal in 1936.

[2] Mahalanobis's definition was prompted by the problem of identifying the similarities of skulls based on measurements (the earliest work related to similarities of skulls are from 1922 and another later work is from 1927).

Bose later obtained the sampling distribution of Mahalanobis distance, under the assumption of equal dispersion.

The Mahalanobis distance is thus unitless, scale-invariant, and takes into account the correlations of the data set.

We can find useful decompositions of the squared Mahalanobis distance that help to explain some reasons for the outlyingness of multivariate observations and also provide a graphical tool for identifying outliers.

is the dimension of the affine span of the samples, then the Mahalanobis distance can be computed as usual.

Consider the problem of estimating the probability that a test point in N-dimensional Euclidean space belongs to a set, where we are given sample points that definitely belong to that set.

Our first step would be to find the centroid or center of mass of the sample points.

Intuitively, the closer the point in question is to this center of mass, the more likely it is to belong to the set.

The simplistic approach is to estimate the standard deviation of the distances of the sample points from the center of mass.

This intuitive approach can be made quantitative by defining the normalized distance between the test point and the set to be

By plugging this into the normal distribution, we can derive the probability of the test point belonging to the set.

The drawback of the above approach was that we assumed that the sample points are distributed about the center of mass in a spherical manner.

Were the distribution to be decidedly non-spherical, for instance ellipsoidal, then we would expect the probability of the test point belonging to the set to depend not only on the distance from the center of mass, but also on the direction.

Putting this on a mathematical basis, the ellipsoid that best represents the set's probability distribution can be estimated by building the covariance matrix of the samples.

For a normal distribution in any number of dimensions, the probability density of an observation

The Mahalanobis distance is proportional, for a normal distribution, to the square root of the negative log-likelihood (after adding a constant so the minimum is at zero).

The sample mean and covariance matrix can be quite sensitive to outliers, therefore other approaches for calculating the multivariate location and scatter of data are also commonly used when calculating the Mahalanobis distance.

The Minimum Covariance Determinant approach estimates multivariate location and scatter from a subset numbering

[11] Each method varies in its definition of the distribution of the data, and therefore produces different Mahalanobis distances.

The Minimum Covariance Determinant and Minimum Volume Ellipsoid approaches are more robust to samples that contain outliers, while the sample mean and covariance matrix tends to be more reliable with small and biased data sets.

If we square both sides, and take the square-root, we will get an equation for a metric that looks a lot like the Mahalanobis distance:

Mahalanobis distance is widely used in cluster analysis and classification techniques.

It is closely related to Hotelling's T-square distribution used for multivariate statistical testing and Fisher's linear discriminant analysis that is used for supervised classification.

Mahalanobis distance and leverage are often used to detect outliers, especially in the development of linear regression models.

A point that has a greater Mahalanobis distance from the rest of the sample population of points is said to have higher leverage since it has a greater influence on the slope or coefficients of the regression equation.

Regression techniques can be used to determine if a specific case within a sample population is an outlier via the combination of two or more variable scores.

, for example), making Mahalanobis distance a more sensitive measure than checking dimensions individually.

Another example of usage is in finance, where Mahalanobis distance has been used to compute an indicator called the "turbulence index",[16] which is a statistical measure of financial markets abnormal behaviour.

[17] Many programming languages and statistical packages, such as R, Python, etc., include implementations of Mahalanobis distance.

Hypothetical two-dimensional example of Mahalanobis distance with three different methods of defining the multivariate location and scatter of the data.