Clustering high-dimensional data

Such high-dimensional spaces of data are often encountered in areas such as medicine, where DNA microarray technology can produce many measurements at once, and the clustering of text documents, where, if a word-frequency vector is used, the number of dimensions equals the size of the vocabulary.

Four problems need to be overcome for clustering in high-dimensional data:[1] Recent research indicates that the discrimination problems only occur when there is a high number of irrelevant dimensions, and that shared-nearest-neighbor approaches can improve results.

[1] An overall different approach is to find clusters based on pattern in the data matrix, often referred to as biclustering, which is a technique frequently utilized in bioinformatics.

Bottom-up methods (such as CLIQUE) heuristically identify relevant dimensions by dividing the data space into a grid structure, selecting dense units, and then iteratively linking them if they are adjacent and dense.

[3] The adjacent image shows a mere two-dimensional space where a number of clusters can be identified.

cannot be considered a cluster in a two-dimensional (sub-)space, since it is too sparsely distributed in the

Hence, subspace clustering algorithms utilize some kind of heuristic to remain computationally feasible, at the risk of producing inferior results.

association rules) can be used to build higher-dimensional subspaces only by combining lower-dimensional ones, as any subspace T containing a cluster, will result in a full space S also to contain that cluster (i.e. S ⊆ T), an approach taken by most of the traditional algorithms such as CLIQUE,[4] SUBCLU.

[5] It is also possible to define a subspace using different degrees of relevance for each dimension, an approach taken by iMWK-Means,[6] EBK-Modes[7] and CBK-Modes.

The general approach is to use a special distance function together with a regular clustering algorithm.

Projection-based clustering is based on a nonlinear projection of high-dimensional data into a two-dimensional space.

[11] Typical projection-methods like t-distributed stochastic neighbor embedding (t-SNE),[12] or neighbor retrieval visualizer (NerV) [13] are used to project data explicitly into two dimensions disregarding the subspaces of higher dimension than two and preserving only relevant neighborhoods in high-dimensional data.

[15] The shortest paths are then used in the clustering process, which involves two choices depending on the structure type in the high-dimensional data.

[11] Projection-based clustering is accessible in the open-source R package "ProjectionBasedClustering" on CRAN.

[18][19] Since high-dimensional data are likely to have many non-informative features, weights can be used during the bagging process to increase the impact of the more informative aspects.

[23] Another hybrid approach is to include a human-into-the-algorithmic-loop: Human domain expertise can help to reduce an exponential search space through heuristic selection of samples.

This can be beneficial in the health domain where, e.g., medical doctors are confronted with high-dimensional descriptions of patient conditions and measurements on the success of certain therapies.

An important question in such data is to compare and correlate patient conditions and therapy results along with combinations of dimensions.

This is because irrelevant, redundant, and conflicting dimensions can negatively affect effectiveness and efficiency of the whole analytic process.

[24] Another type of subspaces is considered in Correlation clustering (Data Mining).