Curse of dimensionality

[1][2] The curse generally refers to issues that arise when the number of datapoints is small (in a suitably defined sense) relative to the intrinsic dimension of the data.

Dimensionally cursed phenomena occur in domains such as numerical analysis, sampling, combinatorics, machine learning, data mining and databases.

In order to obtain a reliable result, the amount of data needed often grows exponentially with the dimensionality.

There is an exponential increase in volume associated with adding extra dimensions to a mathematical space.

In the above example n = 2: when using a sampling distance of 0.01 the 10-dimensional hypercube appears to be 1018 "larger" than the unit interval.

When solving dynamic optimization problems by numerical backward induction, the objective function must be computed for each combination of values.

[3] In machine learning problems that involve learning a "state-of-nature" from a finite number of data samples in a high-dimensional feature space with each feature having a range of possible values, typically an enormous amount of training data is required to ensure that there are several samples with each combination of values.

[6] This phenomenon states that with a fixed number of training samples, the average (expected) predictive power of a classifier or regressor first increases as the number of dimensions or features used is increased but beyond a certain dimensionality it starts deteriorating instead of improving steadily.

In other words, both the size of additional features and their (relative) cumulative discriminatory effect are important in observing a decrease or increase in the average predictive power.

[11] A loss function for unitary-invariant dissimilarity between word embeddings was found to be minimized in high dimensions.

A common practice of data mining in this domain would be to create association rules between genetic mutations that lead to the development of cancers.

The complexity of this algorithm can lead to calculating all permutations of gene pairs for each individual or row.

This problem critically affects both computational time and space when searching for associations or optimal features to consider.

Each genetic mutation, whether they correlate with cancer or not, will have some input or weight in the model that guides the decision-making process of the algorithm.

One way to illustrate the "vastness" of high-dimensional Euclidean space is to compare the proportion of an inscribed hypersphere with radius

Thus, when uniformly generating points in high dimensions, both the "middle" of the hypercube, and the corners are empty, and all the volume is concentrated near the surface of a sphere of "intermediate" radius

By the law of large numbers, this distribution concentrates itself in a narrow band around d times the standard deviation squared (σ2) of the original derivation.

This illuminates the chi-squared distribution and also illustrates that most of the volume of the d-cube concentrates near the boundary of a sphere of radius

[15] When attributes are correlated, data can become easier and provide higher distance contrast and the signal-to-noise ratio was found to play an important role, thus feature selection should be used.

[15] More recently, it has been suggested that there may be a conceptual flaw in the argument that contrast-loss creates a curse in high dimensions.

The curse's derivation assumes all instances are independent, identical outcomes of a single high dimensional generative process.

If there is only one generative process, there would exist only one (naturally occurring) class and machine learning would be conceptually ill-defined in both high and low dimensions.

From this perspective, contrast-loss makes high dimensional distances especially meaningful and not especially non-meaningful as is often argued.

It is not possible to quickly reject candidates by using the difference in one coordinate as a lower bound for a distance based on all the dimensions.

In time series analysis, where the data are inherently high-dimensional, distance functions also work reliably as long as the signal-to-noise ratio is high enough.

[21] This phenomenon can have a considerable impact on various techniques for classification (including the k-NN classifier), semi-supervised learning, and clustering,[22] and it also affects information retrieval.

Surprisingly and despite the expected "curse of dimensionality" difficulties, common-sense heuristics based on the most straightforward methods "can yield results which are almost surely optimal" for high-dimensional problems.

[24] Donoho in his "Millennium manifesto" clearly explained why the "blessing of dimensionality" will form a basis of future data mining.

[25] The effects of the blessing of dimensionality were discovered in many applications and found their foundation in the concentration of measure phenomena.

"[27] For example, the typical property of essentially high-dimensional probability distributions in a high-dimensional space is: the squared distance of random points to a selected point is, with high probability, close to the average (or median) squared distance.