Oversampling and undersampling in data analysis

These terms are used both in statistical sampling, survey design methodology and in machine learning.

Oversampling and undersampling are opposite and roughly equivalent techniques.

[1][2] Both oversampling and undersampling involve introducing a bias to select more samples from one class than from another, to compensate for an imbalance that is either already present in the data, or likely to develop if a purely random sample were taken.

Specifically, while one needs a suitably large sample size to draw valid statistical conclusions, the data must be cleaned before it can be used.

Random Oversampling involves supplementing the training data with multiple copies of some of the minority classes.

[3] Instead of duplicating every sample in the minority class, some of them may be randomly chosen with replacement.

There are a number of methods available to oversample a dataset used in a typical classification problem (using a classification algorithm to classify a set of images, given a labelled training set of images).

[4] However, this technique has been shown to yield poorly calibrated models, with an overestimated probability to belong to the minority class.

The feature space for the minority class for which we want to oversample could be beak length, wingspan, and weight (all continuous).

To then oversample, take a sample from the dataset, and consider its k nearest neighbors (in feature space).

Many modifications and extensions have been made to the SMOTE method ever since its proposal.

[6] The adaptive synthetic sampling approach, or ADASYN algorithm,[7] builds on the methodology of SMOTE, by shifting the importance of the classification boundary to those minority classes which are difficult.

ADASYN uses a weighted distribution for different minority class examples according to their level of difficulty in learning, where more synthetic data is generated for minority class examples that are harder to learn.

It acts as a regularizer and helps reduce overfitting when training a machine learning model.

[8] (See: Data augmentation) Randomly remove samples from the majority class, with or without replacement.

This is one of the earliest techniques used to alleviate imbalance in the dataset, however, it may increase the variance of the classifier and is very likely to discard useful or important samples.

By removing overlapping examples, one can establish well-defined clusters in the training set and lead to improved classification performance.

Undersampling with ensemble learning A recent study shows that the combination of Undersampling with ensemble learning can achieve better results, see IFME: information filtering by multiple examples with under-sampling in a digital library environment.

[10] Although sampling techniques have been developed mostly for classification tasks, growing attention is being paid to the problem of imbalanced regression.

[11] Adaptations of popular strategies are available, including undersampling, oversampling and SMOTE.

[15] It's possible to combine oversampling and undersampling techniques into a hybrid strategy.

Additional ways of learning on imbalanced datasets include weighing training instances, introducing different misclassification costs for positive and negative examples and bootstrapping.

[16] Poor models in [the binary classification] setting are often a result of—any combination of—fitting deterministic classifiers, using re-sampling or re-weighting methods to balance class frequencies in the training data and evaluating the model with a score such as accuracy.

(through Bayes rule) will be wrongly calibrated if modifying the natural distribution

[17] This point can be illustrated with a simple example: Assume no predictive variables

Additionally, may be applied by practitioners in multi-class classification or situations with very imbalanced cost structure.

Finding the best multi-class classification performance or the best tradeoff between precision and recall is, however, inherently a multi-objective optimization problem.

It is well known that these problems typically have multiple incomparable Pareto optimal solutions.

Oversampling or undersampling as well as assigning weights to samples is an implicit way to find a certain pareto optimum (and it sacrifices the calibration of the estimated probabilities).

A more explicit way than oversampling or downsampling could be to select a Pareto optimum by