Medcouple

[1] It is defined as a scaled median difference between the left and right half of a distribution.

Its robustness makes it suitable for identifying outliers in adjusted boxplots.

[2][3] Ordinary box plots do not fare well with skew distributions, since they label the longer unsymmetrical tails as outliers.

As a kind of order statistic, the medcouple belongs to the class of incomplete generalised L-statistics.

[1] Like the ordinary median or mean, the medcouple is a nonparametric statistic, thus it can be computed for any distribution.

For the special case of values tied to the median, we define the kernel by the signum function.

, it belongs to the class of incomplete generalised L-statistics.

Thus, the medcouple is independent of the mean and standard deviation of a distribution, a desirable property for measuring skewness.

For ease of computation, these properties enable us to define the two sets where

, the medcouple kernel reduces to Using the recentred and rescaled set

With properties 1, 2, and 4, we can thus define the following matrix, If we sort the sets

The fact that the rows and columns are sorted allows the implementation of a fast algorithm for computing the medcouple.

The breakdown point is the number of values that a statistic can resist before it becomes meaningless, i.e. the number of arbitrarily large outliers that the data set

For the medcouple, the breakdown point is 25%, since it is a median taken over the couples

[1]: 998 Before presenting medcouple algorithms, we recall that there exist

Since the medcouple is a median, ordinary algorithms for median-finding are important.

entries in the matrix in the case when all elements of the data set

The final call to median on a vector of size

operations, hence the entire naïve medcouple algorithm is of the same complexity.

, we instead exploit the monotonicity in rows and columns, via the following observations.

This method works because of the row-column sorted property of

vector can be visualised as establishing a boundary on the matrix as suggested by the following diagram, where the red entries are all larger than

in the opposite direction, from the top right to the bottom left: This lower boundary can be visualised like so, where the blue entries are smaller than

, with strict inequality occurring only for those rows that have values equal to

For example, the median of the row medians across the entire matrix is less than the upper left quadrant in red, but greater than the lower right quadrant in blue: More generally, using the boundaries given by the

vectors from the previous section, we can assume that after some iterations, we have pinpointed the position of the medcouple to lie between the red left boundary and the blue right boundary:[4]: 149 The yellow entries indicate the median of each row.

If we mentally re-arrange the rows so that the medians align and ignore the discarded entries outside the boundaries, we can select a weighted median of these medians, each entry weighted by the number of remaining entries on this row.

time, since the rows are sorted, and the weighted median can be computed in

[4]: 148 Putting together these two observations, the fast medcouple algorithm proceeds broadly as follows.

In real-world use, the algorithm also needs to account for errors arising from finite-precision floating point arithmetic.

A histogram of 5000 random values sampled from a skew gamma distribution above, and the corresponding histogram of the medcouple kernel values below. The actual medcouple is the median of the bottom distribution, marked at 0.188994 with a yellow line.
A visualisation of the fast medcouple algorithm. It begins with a matrix with sorted rows and sorted columns, where darker squares are smaller than lighter squares. At each iteration, the weighted median of row medians is picked, in yellow. It is then compared to the rest of the matrix to produce candidate red upper and blue lower boundaries. The algorithm then selects the boundary which is known to exclude the global matrix median, by considering the number of entries excluded by this boundary (which is equivalent to considering the rank of the yellow entry). The algorithm then proceeds until the yellow weighted median of row medians is exactly the medcouple, or the number of candidate entries is small enough to perform a selection sort amongst the remaining entries.