Cluster analysis

It can be achieved by various algorithms that differ significantly in their understanding of what constitutes a cluster and how to efficiently find them.

Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure.

It is often necessary to modify data preprocessing and model parameters until the result achieves the desired properties.

These methods will not produce a unique partitioning of the data set, but a hierarchy from which the user still needs to choose appropriate clusters.

It does however only find a local optimum, and is commonly run multiple times with different random initializations.

Furthermore, the algorithms prefer clusters of approximately similar size, as they will always assign an object to the nearest centroid.

This makes it possible to apply the well-developed algorithmic solutions from the facility location literature to the presently considered centroid-based clustering problem.

While the theoretical foundation of these methods is excellent, they suffer from overfitting unless constraints are put on the model complexity.

Standard model-based clustering methods include more parsimonious models based on the eigenvalue decomposition of the covariance matrices, that provide a balance between overfitting and fidelity to the data.

Objects in sparse areas – that are required to separate clusters – are usually considered to be noise and border points.

[15] In contrast to many newer methods, it features a well-defined cluster model called "density-reachability".

However, it only connects points that satisfy a density criterion, in the original variant defined as a minimum number of other objects within this radius.

Another interesting property of DBSCAN is that its complexity is fairly low – it requires a linear number of range queries on the database – and that it will discover essentially the same results (it is deterministic for core and noise points, but not for border points) in each run, therefore there is no need to run it multiple times.

The key drawback of DBSCAN and OPTICS is that they expect some kind of density drop to detect cluster borders.

Mean-shift is a clustering approach where each object is moved to the densest area in its vicinity, based on kernel density estimation.

Due to the expensive iterative procedure and density estimation, mean-shift is usually slower than DBSCAN or k-Means.

Besides that, the applicability of the mean-shift algorithm to multidimensional data is hindered by the unsmooth behaviour of the kernel density estimate, which results in over-fragmentation of cluster tails.

[32] Using genetic algorithms, a wide range of different fit-functions can be optimized, including mutual information.

[33] Also belief propagation, a recent development in computer science and statistical physics, has led to the creation of new types of clustering algorithms.

[36] Internal evaluation measures suffer from the problem that they represent functions that themselves can be seen as a clustering objective.

By using such an internal measure for evaluation, one rather compares the similarity of the optimization problems,[36] and not necessarily how useful the clustering is.

On the other hand, the labels only reflect one possible partitioning of the data set, which does not imply that there does not exist a different, and maybe even better, clustering.

Neither of these approaches can therefore ultimately judge the actual quality of a clustering, but this needs human evaluation,[36] which is highly subjective.

Nevertheless, such statistics can be quite informative in identifying bad clusterings,[37] but one should not dismiss subjective human evaluation.

[5] Validity as measured by such an index depends on the claim that this kind of structure exists in the data set.

On a data set with non-convex clusters neither the use of k-means, nor of an evaluation criterion that assumes convexity, is sound.

[35] These types of evaluation methods measure how close the clustering is to the predetermined benchmark classes.

However, it has recently been discussed whether this is adequate for real data, or only on synthetic data sets with a factual ground truth, since classes can contain internal structure, the attributes present may not allow separation of clusters or the classes may contain anomalies.

In place of counting the number of times a class was correctly assigned to a single data point (known as true positives), such pair counting metrics assess whether each pair of data points that is truly in the same cluster is predicted to be in the same cluster.

The F-measure addresses this concern,[citation needed] as does the chance-corrected adjusted Rand index.