Features and functionality described on this page are available with Prism Enterprise. |
The Calinski-Harabasz index, also known as the variance ratio criterion, is one of the most widely used methods for determining the optimal number of clusters in clustering analysis. Developed by Calinski and Harabasz in 1974, this method evaluates clustering quality by comparing the between-cluster dispersion to the within-cluster dispersion. The optimal number of clusters corresponds to the maximum value of the Calinski-Harabasz index.
The fundamental principle behind this method is that good clustering should maximize the separation between clusters while minimizing the variance within clusters. The Calinski-Harabasz index captures this trade-off by calculating a ratio: clusters that are well-separated from each other and internally cohesive will produce higher index values.
The Calinski-Harabasz index is calculated as follows:
where:
•k is the number of clusters
•n is the total number of data points
•BCSS is the between-cluster sum of squares
•WCSS is the within-cluster sum of squares
The between-cluster sum of squares measures the dispersion between cluster centroids:
where:
•ni is the number of points in cluster i
•ci is the centroid of cluster i
•c̄ is the overall centroid of all data points
•||ci - c̄||2 is the squared Euclidean distance between cluster centroid i and the overall centroid
The within-cluster sum of squares measures the dispersion within clusters:
where:
•Ci is the set of points in cluster i
•x represents each individual data point in cluster i
•ci is the centroid of cluster i
•c̄ is the overall centroid of all data points
•||x - ci||2 is the squared Euclidean distance between point x and its cluster centroid
Note that WCSS is identical to the Within-Cluster Sum of Squares used in the elbow method.
The Calinski-Harabasz index increases when clusters are dense and well-separated. Higher values indicate better clustering:
•Higher CH values: Suggest that the clustering solution has compact, well-separated clusters
•Lower CH values: Indicate either that clusters overlap significantly or that points within clusters are widely dispersed
The optimal number of clusters is determined by finding the value of k that maximizes the Calinski-Harabasz index. Unlike some other methods that look for "elbow" points or threshold values, this method simply selects the peak value.
The Calinski-Harabasz index offers several advantages:
•It's computationally efficient and doesn't require additional simulations
•The interpretation is straightforward (higher is better)
•It works well when clusters are roughly spherical and similar in size
However, like most clustering validation methods, it may be less reliable when:
•Clusters have very different sizes or densities
•Clusters have non-spherical shapes
•The data contains significant noise or outliers
The Calinski-Harabasz index is one of 17 methods used in Prism's consensus approach for determining optimal cluster numbers, as described on the cluster metrics page.