In medical research, it is often used to measure the fraction of patients living for a certain amount of time after treatment.
In other fields, Kaplan–Meier estimators may be used to measure the length of time people remain unemployed after a job loss,[3] the time-to-failure of machine parts, or how long fleshy fruits remain on plants before they are removed by frugivores.
The estimator is named after Edward L. Kaplan and Paul Meier, who each submitted similar manuscripts to the Journal of the American Statistical Association.
[4] The journal editor, John Tukey, convinced them to combine their work into one paper, which has been cited more than 34,000 times since its publication in 1958.
the individuals known to have survived (have not yet had an event or been censored) up to time
A plot of the Kaplan–Meier estimator is a series of declining horizontal steps which, with a large enough sample size, approaches the true survival function for that population.
The value of the survival function between successive distinct sampled observations ("clicks") is assumed to be constant.
An important advantage of the Kaplan–Meier curve is that the method can take into account some types of censored data, particularly right-censoring, which occurs if a patient withdraws from a study, is lost to follow-up, or is alive without event occurrence at last follow-up.
On the plot, small vertical tick-marks state individual patients whose survival times have been right-censored.
When no truncation or censoring occurs, the Kaplan–Meier curve is the complement of the empirical distribution function.
In medical statistics, a typical application might involve grouping patients into categories, for instance, those with Gene A profile and those with Gene B profile.
To generate a Kaplan–Meier estimator, at least two pieces of data are required for each patient (or each subject): the status at last observation (event occurrence or right-censored), and the time to event (or time to censoring).
be a random variable as the time that passes between the start of the possible exposure period,
, and the time that the event of interest takes place,
is a fixed, deterministic integer, the censoring time of event
Both are based on rewriting the survival function in terms of what is sometimes called hazard, or mortality rates.
A basic argument shows that the following proposition holds: Let
is a sequence of independent, identically distributed Bernoulli random variables with common parameter
is small, which happens, by definition, when a lot of the events are censored.
also holds, we can infer that events often happen early, which implies that
Note that the naive estimator cannot be improved when censoring does not take place; so whether an improvement is possible critically hinges upon whether censoring is in place.
is integer valued and for the last line we introduced By a recursive expansion of the equality
yields: where hat is used to denote maximum likelihood estimation.
Given this result, we can write: More generally (for continuous as well as discrete survival distributions), the Kaplan-Meier estimator may be interpreted as a nonparametric maximum likelihood estimator.
[9] The Kaplan–Meier estimator is one of the most frequently used methods of survival analysis.
The estimate may be useful to examine recovery rates, the probability of death, and the effectiveness of treatment.
Greenwood's formula is derived[12][self-published source?]
As a result for maximum likelihood hazard rate
To avoid dealing with multiplicative probabilities we compute variance of logarithm of
and will use the delta method to convert it back to the original variance: using martingale central limit theorem, it can be shown that the variance of the sum in the following equation is equal to the sum of variances:[12] as a result we can write: using the delta method once more: as desired.