Imputation (statistics)

In statistics, imputation is the process of replacing missing data with substituted values.

That is to say, when one or more values are missing for a case, most statistical packages default to discarding any case that has a missing value, which may introduce bias or affect the representativeness of the results.

Imputation preserves all cases by replacing missing data with an estimated value based on other available information.

[2] There have been many theories embraced by scientists to account for missing data but the majority of them introduce bias.

For example, if 1000 cases are collected but 80 have missing values, the effective sample size after listwise deletion is 920.

When pairwise deletion is used, the total N for analysis will not be consistent across parameter estimations.

Because of the incomplete N values at some points in time, while still maintaining complete case comparison for other parameters, pairwise deletion can introduce impossible mathematical situations such as correlations that are over 100%.

[5] The one advantage complete case deletion has over other methods is that it is straightforward and easy to implement.

This is a large reason why complete case is the most popular method of handling missing data in spite of the many disadvantages it has.

The term "hot deck" dates back to the storage of data on punched cards, and indicates that the information donors come from the same dataset as the recipients.

One form of hot-deck imputation is called "last observation carried forward" (or LOCF for short), which involves sorting a dataset according to any of a number of variables, thus creating an ordered dataset.

It is a method of replacing with response values of similar items in past surveys.

Mean imputation can be carried out within classes (i.e. categories such as gender), and can be expressed as

is a dummy variable for class membership, and data are split into respondent (

In other words, available information for complete and incomplete cases is used to predict the value of a specific variable.

This causes relationships to be over identified and suggest greater precision in the imputed values than is warranted.

Stochastic regression shows much less bias than the above-mentioned techniques, but it still missed one thing – if data are imputed then intuitively one would think that more noise should be introduced to the problem than simple residual variance.

[5] In order to deal with the problem of increased noise due to imputation, Rubin (1987)[10] developed a method for averaging the outcomes across multiple imputed data sets to account for this.

"[15] MICE is designed for missing at random data, though there is simulation evidence to suggest that with a sufficient number of auxiliary variables it can also work on data that are missing not at random.

However, MICE can suffer from performance problems when the number of observation is large and the data have complex features, such as nonlinearities and high dimensionality.

More recent approaches to multiple imputation use machine learning techniques to improve its performance.

MIDAS (Multiple Imputation with Denoising Autoencoders), for instance, uses denoising autoencoders, a type of unsupervised neural network, to learn fine-grained latent representations of the observed data.

[16] MIDAS has been shown to provide accuracy and efficiency advantages over traditional multiple imputation strategies.

The negligence of uncertainty in the imputation can lead to overly precise results and errors in any conclusions drawn.

As expected, the combination of both uncertainty estimation and deep learning for imputation is among the best strategies and has been used to model heterogeneous drug discovery data.