Silhouette (clustering)

Silhouette is a method of interpretation and validation of consistency within clusters of data.

The technique provides a succinct graphical representation of how well each object has been classified.

[1] It was proposed by Belgian statistician Peter Rousseeuw in 1987.

The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation).

The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.

If most objects have a high value, then the clustering configuration is appropriate.

A clustering with an average silhouette width of over 0.7 is considered to be "strong", a value over 0.5 "reasonable" and over 0.25 "weak", but with increasing dimensionality of the data, it becomes difficult to achieve such high values because of the curse of dimensionality, as the distances become more similar.

Assume the data have been clustered via any technique, such as k-medoids or k-means, into

We now define a silhouette (value) of one data point

is not clearly defined for clusters with size = 1, in which case we set

This choice is arbitrary, but neutral in the sense that it is at the midpoint of the bounds, -1 and 1.

is to its own cluster, a small value means it is well matched.

close to 1 means that the data is appropriately clustered.

near zero means that the datum is on the border of two natural clusters.

Thus silhouette plots and means may be used to determine the natural number of clusters within a dataset.

One can also increase the likelihood of the silhouette being maximized at the correct number of clusters by re-scaling the data using feature weights that are cluster specific.

[4] Kaufman et al. introduced the term silhouette coefficient for the maximum value of the mean

over all data of the entire dataset for a specific number of clusters

pairwise distances, making this evaluation much more costly than clustering with k-means.

is always defined, then define accordingly the simplified silhouette and simplified silhouette coefficient[6] If the cluster centers are medoids (as in k-medoids clustering) instead of arithmetic means (as in k-means clustering), this is also called the medoid-based silhouette[7] or medoid silhouette.

[8] If every object is assigned to the nearest medoid (as in k-medoids clustering), we know that

[8] Instead of using the average silhouette to evaluate a clustering obtained from, e.g., k-medoids or k-means, we can try to directly find a solution that maximizes the Silhouette.

We do not have a closed form solution to maximize this, but it will usually be best to assign points to the nearest cluster as done by these methods.

Van der Laan et al.[7] proposed to adapt the standard algorithm for k-medoids, PAM, for this purpose and call this algorithm PAMSIL: The loop in step 3 is executed for

Because this is a fairly expensive operation, the authors propose to also use the medoid-based silhouette, and call the resulting algorithm PAMMEDSIL.

Batool et al. propose a similar algorithm under the name OSil, and propose a CLARA-like sampling strategy for larger data sets, that solves the problem only for a sub-sample.

[9] By adopting recent improvements to the PAM algorithm, FastMSC reduces the runtime using the medoid silhouette to just

[8] By starting with a maximum number of clusters kmax and iteratively removing the worst center (in terms of a change in silhouette) and re-optimizing, the best (highest medoid silhouette) clustering can be automatically determined.

As data structures can be reused, this reduces the computation cost substantially over repeatedly running the algorithm for different numbers of clusters.

memory requirement is the main limiting factor for applying this to very large data sets.