Quantile

As in the computation of, for example, standard deviation, the estimation of a quantile depends upon whether one is operating with a statistical population or with a sample drawn from it.

That is, x is a k-th q-quantile for a variable X if and For a finite population of N equally probable values indexed 1, …, N from lowest to highest, the k-th q-quantile of this population can equivalently be computed via the value of Ip = N k/q.

This broader terminology is used when quantiles are used to parameterize continuous probability distributions.

Moreover, some software programs (including Microsoft Excel) regard the minimum and maximum as the 0th and 100th percentile, respectively.

The above formula can be used to bound the value μ + zσ in terms of quantiles.

For example, the value that is z = 1 standard deviation above the mean is always greater than or equal to Q(p = 0.5), the median, and the value that is z = 2 standard deviations above the mean is always greater than or equal to Q(p = 0.8), the fourth quintile.

One problem which frequently arises is estimating a quantile of a (very large or infinite) population based on a finite sample of size N. Modern statistical packages rely on a number of techniques to estimate the quantiles.

All methods compute Qp, the estimate for the p-quantile (the k-th q-quantile, where p = k/q) from a sample of size N by computing a real valued index h. When h is an integer, the h-th smallest of the N values, xh, is the quantile estimate.

Mathematica,[3] Matlab,[4] R[5] and GNU Octave[6] programming languages support all nine sample quantile methods.

The estimate types and interpolation schemes used include: Notes: Of the techniques, Hyndman and Fan recommend R-8, but most statistical software packages have chosen R-6 or R-7 as the default.

[14] The sample median is the most examined one amongst quantiles, being an alternative to estimate a location parameter, when the expected value of the distribution does not exist, and hence the sample mean is not a meaningful estimator of a population characteristic.

A solution to this problem is to use an alternative definition of sample quantiles through the concept of the "mid-distribution" function, which is defined as The definition of sample quantiles through the concept of mid-distribution function can be seen as a generalization that can cover as special cases the continuous distributions.

The t-digest maintains a data structure of bounded size using an approach motivated by k-means clustering to group similar values.

The KLL algorithm uses a more sophisticated "compactor" method that leads to better control of the error bounds at the cost of requiring an unbounded size if errors must be bounded relative to p. Both methods belong to the family of data sketches that are subsets of Streaming Algorithms with useful properties: t-digest or KLL sketches can be combined.

Another class of algorithms exist which assume that the data are realizations of a random process.

There are a number of such algorithms such as those based on stochastic approximation[18][19] or Hermite series estimators.

[20] These statistics based algorithms typically have constant update time and space complexity, but have different error bound guarantees compared to computer science type methods and make more assumptions.

The statistics based algorithms do present certain advantages however, particularly in the non-stationary streaming setting i.e. time-varying data.

[21] Standardized test results are commonly reported as a student scoring "in the 80th percentile", for example.

[22] This separate meaning of percentile is also used in peer-reviewed scientific research articles.

This is because the exponential distribution has a long tail for positive values but is zero for negative numbers.

Quantiles are useful measures because they are less susceptible than means to long-tailed distributions and outliers.

Empirically, if the data being analyzed are not actually distributed according to an assumed distribution, or if there are other potential sources for outliers that are far removed from the mean, then quantiles may be more useful descriptive statistics than means and other moment-related statistics.

Closely related is the subject of least absolute deviations, a method of regression that is more robust to outliers than is least squares, in which the sum of the absolute value of the observed errors is used in place of the squared error.

The quantiles of a random variable are preserved under increasing transformations, in the sense that, for example, if m is the median of a random variable X, then 2m is the median of 2X, unless an arbitrary choice has been made from a range of values to specify a particular quantile.

Values that divide sorted data into equal subsets other than four have different names.