Please enable JavaScript to view this site.

Features and functionality described on this page are available with Prism Enterprise.

The Calinski-Harabasz index, also known as the variance ratio criterion, is one of the most widely used methods for determining the optimal number of clusters in clustering analysis. Developed by Calinski and Harabasz in 1974, this method evaluates clustering quality by comparing the between-cluster dispersion to the within-cluster dispersion. The optimal number of clusters corresponds to the maximum value of the Calinski-Harabasz index.

The fundamental principle behind this method is that good clustering should maximize the separation between clusters while minimizing the variance within clusters. The Calinski-Harabasz index captures this trade-off by calculating a ratio: clusters that are well-separated from each other and internally cohesive will produce higher index values.

Mathematical calculations

The Calinski-Harabasz index is calculated as follows:

where:

k is the number of clusters

n is the total number of data points

BCSS is the between-cluster sum of squares

WCSS is the within-cluster sum of squares

Between-cluster sum of squares (BCSS)

The between-cluster sum of squares measures the dispersion between cluster centroids:

where:

ni is the number of points in cluster i

ci is the centroid of cluster i

is the overall centroid of all data points

||ci - c̄||2 is the squared Euclidean distance between cluster centroid i and the overall centroid

Within-cluster sum of squares (WCSS)

The within-cluster sum of squares measures the dispersion within clusters:

where:

Ci is the set of points in cluster i

x represents each individual data point in cluster i

ci is the centroid of cluster i

is the overall centroid of all data points

||x - ci||2 is the squared Euclidean distance between point x and its cluster centroid

Note that WCSS is identical to the Within-Cluster Sum of Squares used in the elbow method.

Interpretation

The Calinski-Harabasz index increases when clusters are dense and well-separated. Higher values indicate better clustering:

Higher CH values: Suggest that the clustering solution has compact, well-separated clusters

Lower CH values: Indicate either that clusters overlap significantly or that points within clusters are widely dispersed

The optimal number of clusters is determined by finding the value of k that maximizes the Calinski-Harabasz index. Unlike some other methods that look for "elbow" points or threshold values, this method simply selects the peak value.

Advantages and considerations

The Calinski-Harabasz index offers several advantages:

It's computationally efficient and doesn't require additional simulations

The interpretation is straightforward (higher is better)

It works well when clusters are roughly spherical and similar in size

However, like most clustering validation methods, it may be less reliable when:

Clusters have very different sizes or densities

Clusters have non-spherical shapes

The data contains significant noise or outliers

The Calinski-Harabasz index is one of 17 methods used in Prism's consensus approach for determining optimal cluster numbers, as described on the cluster metrics page.

© 1995-2019 GraphPad Software, LLC. All rights reserved.