Determining the number of clusters in a data set

The correct choice of k is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a data set and the desired clustering resolution of the user.

In addition, increasing k without penalty will always reduce the amount of error in the resulting clustering, to the extreme case of zero error if each data point is considered its own cluster (i.e., when k equals the number of data points, n).

If an appropriate value of k is not apparent from prior knowledge of the properties of the data set, it must be chosen somehow.

In most datasets, this "elbow" is ambiguous,[1] making this method subjective and unreliable.

Because the scale of the axes is arbitrary, the concept of an angle is not well-defined, and even on uniform random data, the curve produces an "elbow", making the method rather unreliable.

In statistics and data mining, X-means clustering is a variation of k-means clustering that refines cluster assignments by repeatedly attempting subdivision, and keeping the best resulting splits, until a criterion such as the Akaike information criterion (AIC) or Bayesian information criterion (BIC) is reached.

[6] Rate distortion theory has been applied to choosing k called the "jump" method, which determines the number of clusters that maximizes efficiency while minimizing error by information-theoretic standards.

[7] The strategy of the algorithm is to generate a distortion curve for the input data by running a standard clustering algorithm such as k-means for all values of k between 1 and n, and computing the distortion (described below) of the resulting clustering.

The distortion curve is then transformed by a negative power chosen based on the dimensionality of the data.

The distortion of a clustering of some input data is formally defined as follows: Let the data set be modeled as a p-dimensional random variable, X, consisting of a mixture distribution of G components with common covariance, Γ.

The pseudo-code for the jump method with an input set of p-dimensional data points X is: The choice of the transform power

Let the data X have a single, arbitrarily p-dimensional Gaussian distribution, and let fixed

This behavior is important in the general case of a mixture of multiple distribution components.

Although the mathematical support for the method is given in terms of asymptotic results, the algorithm has been empirically verified to work well in a variety of data sets with reasonable dimensionality.

The broken line method identifies the jump point in the graph of the transformed distortion by doing a simple least squares error line fit of two line segments, which in theory will fall along the x-axis for K < G, and along the linearly increasing phase of the transformed distortion plot for K ≥ G. The broken line method is more robust than the jump method in that its decision is global rather than local, but it also relies on the assumption of Gaussian mixture components, whereas the jump method is fully non-parametric and has been shown to be viable for general mixture distributions.

The average silhouette of the data is another useful criterion for assessing the natural number of clusters.

Optimization techniques such as genetic algorithms are useful in determining the number of clusters that gives rise to the largest silhouette.

[9] It is also possible to re-scale the data in such a way that the silhouette is more likely to be maximized at the correct number of clusters.

The kernel matrix can thus be analyzed in order to find the optimal number of clusters.

It will then analyze the eigenvalues and eigenvectors to obtain a measure of the compactness of the input distribution.

Unlike previous methods, this technique does not need to perform any clustering a-priori.

Robert Tibshirani, Guenther Walther, and Trevor Hastie proposed estimating the number of clusters in a data set via the gap statistic.

[13] The gap statistics, based on theoretical grounds, measures how far is the pooled within-cluster sum of squares around the cluster centers from the sum of squares expected under the null reference distribution of data.

The optimal number of clusters is then estimated as the value of k for which the observed sum of squares falls farthest below the null reference.