Theil–Sen estimator

In Sen's definition, one takes the median of the slopes defined only from pairs of points having distinct x coordinates.

[12] As Sen observed, this choice of slope makes the Kendall tau rank correlation coefficient become approximately zero, when it is used to compare the values xi with their associated residuals yi − mxi − b.

The choice of b does not affect the Kendall coefficient, but causes the median residual to become approximately zero; that is, the fit line passes above and below equal numbers of points.

According to simulations, approximately 600 sample pairs are sufficient to determine an accurate confidence interval.

It can tolerate a greater number of outliers than the Theil–Sen estimator, but known algorithms for computing it efficiently are more complicated and less practical.

It has a breakdown point of meaning that it can tolerate arbitrary corruption of up to 29.3% of the input data-points without degradation of its accuracy.

[20] A higher breakdown point, 50%, holds for a different robust line-fitting algorithm, the repeated median estimator of Siegel.

[22] The problem of performing slope selection exactly but more efficiently than the brute force quadratic time algorithm has been extensively studied in computational geometry.

[24] An estimator for the slope with approximately median rank, having the same breakdown point as the Theil–Sen estimator, may be maintained in the data stream model (in which the sample points are processed one by one by an algorithm that does not have enough persistent storage to represent the entire data set) using an algorithm based on ε-nets.

[26] A free standalone Visual Basic application for Theil–Sen estimation, KTRLine, has been made available by the US Geological Survey.

[29] In biophysics, Fernandes & Leblanc (2005) suggest its use for remote sensing applications such as the estimation of leaf area from reflectance data due to its "simplicity in computation, analytical estimates of confidence intervals, robustness to outliers, testable assumptions regarding residuals and ... limited a priori information regarding measurement errors".

The Theil–Sen estimator of a set of sample points with outliers (black line) compared to the non-robust ordinary least squares line for the same set (blue). The dashed green line represents the ground truth from which the samples were generated.