Cumulative frequency analysis

Cumulative frequency analysis is performed to obtain insight into how often a certain phenomenon (feature) is below a certain value.

This may help in describing or explaining a situation in which the phenomenon is involved, or in planning interventions, for example in flood protection.

It can be adapted to bring in things like climate change causing wetter winters and drier summers.

Another way is to take into account the possibility that in rare cases X may assume values larger than the observed maximum Xmax.

This can be done dividing the cumulative frequency M by N+1 instead of N. The estimate then becomes: There exist also other proposals for the denominator (see plotting positions).

Probability distributions can be fitted by several methods,[2] for example: Application of both types of methods using for example often shows that a number of distributions fit the data well and do not yield significantly different results, while the differences between them may be small compared to the width of the confidence interval.

The figure gives an example of a useful introduction of such a discontinuous distribution for rainfall data in northern Peru, where the climate is subject to the behavior Pacific Ocean current El Niño.

According to the normal theory, the binomial distribution can be approximated and for large N standard deviation Sd can be calculated as follows: where Pc is the cumulative probability and N is the number of data.

It is seen that the standard deviation Sd reduces at an increasing number of observations N. The determination of the confidence interval of Pc makes use of Student's t-test (t).

Then, the lower (L) and upper (U) confidence limits of Pc in a symmetrical distribution are found from: This is known as Wald interval.

The probability of exceedance Pe (also called survival function) is found from: The return period T defined as: and indicates the expected number of observations that have to be done again to find the value of the variable in study greater than the value used for T. The upper (TU) and lower (TL) confidence limits of return periods can be found respectively as: For extreme values of the variable in study, U is close to 1 and small changes in U originate large changes in TU.

The strict notion of return period actually has a meaning only when it concerns a time-dependent phenomenon, like point rainfall.

[1] The confidence belt around an experimental cumulative frequency or return period curve gives an impression of the region in which the true distribution may be found.

Often it is desired to combine the histogram with a probability density function as depicted in the black and white picture.

Cumulative frequency distribution, adapted cumulative probability distribution, and confidence intervals
Ranked cumulative probabilities
Different cumulative normal probability distributions with their parameters
Cumulative frequency distribution with a discontinuity
Binomial distributions for Pc = 0.1 (blue), 0.5 (green) and 0.8 (red) in a sample of size N = 20 . The distribution is symmetrical only when Pc = 0.5
90% binomial confidence belts on a log scale.
Return periods and confidence belt. The curve of the return periods increases exponentially.
Nine return-period curves of 50-year samples from a theoretical 1000-year record (base line)
Histogram derived from the adapted cumulative probability distribution
Histogram and probability density function, derived from the cumulative probability distribution, for a logistic distribution .