Kendall rank correlation coefficient

In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient (after the Greek letter τ, tau), is a statistic used to measure the ordinal association between two measured quantities.

It is named after Maurice Kendall, who developed it in 1938,[1] though Gustav Fechner had proposed a similar measure in the context of time series in 1897.

[2] Intuitively, the Kendall correlation between two variables will be high when observations have a similar (or identical for a correlation of 1) rank (i.e. relative position label of the observations within the variable: 1st, 2nd, 3rd, etc.)

Its notions of concordance and discordance also appear in other areas of statistics, like the Rand index in cluster analysis.

(See the section #Accounting for ties for ways of handling non-unique values.)

The denominator is the total number of pair combinations, so the coefficient must be in the range −1 ≤ τ ≤ 1.

Under the null hypothesis of independence of X and Y, the sampling distribution of τ has an expected value of zero.

Use a result from A class of statistics with asymptotically normal distribution Hoeffding (1948).

are IID samples from the same jointly normal distribution with a known Pearson correlation coefficient

is an IID sample of the jointly normal distribution, the pairing does not matter, so each term in the summation is exactly the same, and so

It remains to perform some unenlightening tedious matrix exponentiations and trigonometry, which can be skipped over.

Since the standard normal distribution is rotationally symmetric, we need only calculate the angle spanned by each squashed quadrant.

When tied pairs arise in the data, the coefficient may be modified in a number of ways to keep it in the range [−1, 1]: The Tau statistic defined by Kendall in 1938[1] was retrospectively renamed Tau-a.

It represents the strength of positive or negative association of two quantitative or ordinal variables without any adjustment for ties.

The Tau-b statistic, unlike Tau-a, makes adjustments for ties.

This Tau-b was first described by Kendall in 1945 under the name Tau-w[12] as an extension of the original Tau statistic supporting ties.

[13] Be aware that some statistical packages, e.g. SPSS, use alternative formulas for computational efficiency, with double the 'usual' number of concordant and discordant pairs.

[16] Contrary to Tau-b, Tau-c can be equal to +1 or -1 for non-square (i.e. rectangular) contingency tables,[15][16] i.e. when the underlying scale of both variables have different number of possible values.

A Tau-C equal to 1 can be interpreted as the best possible positive correlation conditional to marginal distributions while a Tau-B equal to 1 can be interpreted as the perfect positive monotonic correlation where the distribution of X conditional to Y has zero variance and the distribution of Y conditional to X has zero variance so that a bijective function f with f(X)=Y exists.

The Stuart-Kendall Tau-c coefficient is defined as:[16] where When two quantities are statistically dependent, the distribution of

, is approximately distributed as a standard normal when the variables are statistically independent: where

, involves two nested iterations, as characterized by the following pseudocode: Although quick to implement, this algorithm is

is computed as depicted in the following pseudo-code: A side effect of the above steps is that you end up with both a sorted version of

are easily obtained in a single linear-time pass through the sorted arrays.

Efficient algorithms for calculating the Kendall rank correlation coefficient as per the standard estimator have

Fortunately, algorithms do exist to estimate the Kendall rank correlation coefficient in sequential settings.

update time and space complexity, scaling efficiently with the number of observations.

The first such algorithm[19] presents an approximation to the Kendall rank correlation coefficient based on coarsening the joint distribution of the random variables.

The second algorithm[20] is based on Hermite series estimators and utilizes an alternative estimator for the exact Kendall rank correlation coefficient i.e. for the probability of concordance minus the probability of discordance of pairs of bivariate observations.

This algorithm[20] is only applicable to continuous random variables, but it has demonstrated superior accuracy and potential speed gains compared to the first algorithm described,[19] along with the capability to handle non-stationary data without relying on sliding windows.