Missing data

Missing data can occur because of nonresponse: no information is provided for one or more items or for a whole unit ("subject").

Attrition is a type of missingness that can occur in longitudinal studies—for instance studying development where a measurement is repeated after a certain period of time.

Data often are missing in research in economics, sociology, and political science because governments or private entities choose not to, or fail to, report critical statistics,[1] or because the information is not available.

Because of these problems, methodologists routinely advise researchers to design studies to minimize the occurrence of missing values.

With MCAR, the random assignment of treatments is assumed to be preserved, but that is usually an unrealistically strong assumption in practice.

Depending on the analysis method, these data can still induce parameter bias in analyses due to the contingent emptiness of cells (male, very high depression may have zero entries).

Samuelson and Spirer (1992) discussed how missing and/or distorted data about demographics, law enforcement, and health could be indicators of patterns of human rights violations.

[9] Missing data can also arise in subtle ways that are not well accounted for in classical theory.

An increasingly encountered problem arises in which data may not be MAR but missing values exhibit an association or structure, either explicitly or implicitly.

In these situations, missing values may relate to the various sampling methodologies used to collect the data or reflect characteristics of the wider population of interest, and so may impart useful information.

For instance, in a health context, structured missingness has been observed as a consequence of linking clinical, genomic and imaging data.

[10] The presence of structured missingness may be a hindrance to make effective use of data at scale, including through both classical statistical and current machine learning methods.

For example, there might be bias inherent in the reasons why some data might be missing in patterns, which might have implications in predictive fairness for machine learning models.

Furthermore, established methods for dealing with missing data, such as imputation, do not usually take into account the structure of the missing data and so development of new formulations is needed to deal with structured missingness appropriately or effectively.

Finally, characterising structured missingness within the classical framework of MCAR, MAR, and MNAR is a work in progress.

[12] In some practical application, the experimenters can control the level of missingness, and prevent missing values before gathering the data.

So missing values due to the participant are eliminated by this type of questionnaire, though this method may not be permitted by an ethics board overseeing the research.

In survey research, it is common to make multiple efforts to contact each individual in the sample, often sending letters to attempt to persuade those who have decided not to participate to change their minds.

[13]: 161–187 However, such techniques can either help or hurt in terms of reducing the negative inferential effects of missing data, because the kind of people who are willing to be persuaded to participate after initially refusing or not being home are likely to be significantly different from the kinds of people who will still refuse or remain unreachable after additional effort.

An analysis is robust when we are confident that mild to moderate violations of the technique's key assumptions will produce little or no bias, or distortion in the conclusions drawn about the population.

Rubin (1987) argued that repeating imputation even a few times (5 or less) enormously improves the quality of estimation.

However, a too-small number of imputations can lead to a substantial loss of statistical power, and some scholars now recommend 20 to 100 or more.

[15] Methods such as listwise deletion have been used to impute data but it has been found to introduce additional bias.

[17] The expectation-maximization algorithm is an approach in which values of the statistics which would be computed if a complete dataset were available are estimated (imputed), taking into account the pattern of missing data.

[23][24][25]) When data falls into MNAR category techniques are available for consistently estimating parameters when certain conditions hold in the model.

Finally, the estimands that emerge from these techniques are derived in closed form and do not require iterative procedures such as Expectation Maximization that are susceptible to local optima.