The Calinski–Harabasz index (CHI), also known as the Variance Ratio Criterion (VRC), is a metric for evaluating clustering algorithms, introduced by Tadeusz Caliński and Jerzy Harabasz in 1974.
Given a data set of n points: {x1, ..., xn}, and the assignment of these points to k clusters: {C1, ..., Ck}, the Calinski–Harabasz (CH) Index is defined as the ratio of the between-cluster separation (BCSS) to the within-cluster dispersion (WCSS), normalized by their number of degrees of freedom:
BCSS (Between-Cluster Sum of Squares) is the weighted sum of squared Euclidean distances between each cluster centroid (mean) and the overall data centroid (mean):
WCSS measures the compactness or cohesiveness of the clusters (the smaller the better).
Minimizing the WCSS is the objective of centroid-based clustering algorithms such as k-means.
The numerator of the CH index is the between-cluster separation (BCSS) divided by its degrees of freedom.
The denominator of the CH index is the within-cluster dispersion (WCSS) divided by its degrees of freedom.
Dividing both the BCSS and WCSS by their degrees of freedom helps to normalize the values, making them comparable across different numbers of clusters.
Without this normalization, the CH index could be artificially inflated for higher values of k, making it hard to determine whether an increase in the index value is due to genuinely better clustering or just due to the increased number of clusters.
Although there is no satisfactory probabilistic foundation to support the use of CH index, the criterion has some desirable mathematical properties as shown in.
In addition, it is analogous to the F-test statistic in univariate analysis.
Similar to other clustering evaluation metrics such as Silhouette score, the CH index can be used to find the optimal number of clusters k in algorithms like k-means, where the value of k is not known a priori.
This can be done by following these steps: The scikit-learn Python library provides an implementation of this metric in the sklearn.metrics module.