Please enable JavaScript to view this site.

Features and functionality described on this page are available with Prism Enterprise.

The Hartigan index is a clustering validation metric developed by Hartigan in 1975 that evaluates the improvement in clustering quality when increasing the number of clusters. This index measures the relative change in within-cluster sum of squares when moving from k to k+1 clusters, scaled by the number of remaining data points.

The Hartigan index is based on the principle that adding a new cluster should provide a significant improvement in the within-cluster sum of squares. When this improvement becomes small relative to the remaining degrees of freedom, it suggests that the optimal number of clusters has been reached.

Mathematical calculation

The Hartigan index is calculated as follows:

where:

k is the number of clusters

WCSSk is the within-cluster sum of squares for k clusters

WCSSk+1 is the within-cluster sum of squares for k+1 clusters

n is the total number of data points

Within-cluster sum of squares

The within-cluster sum of squares (WCSS) is calculated as:

where:

Ci is the i-th cluster

ci is the centroid of cluster i

||x - ci||2 is the squared Euclidean distance from point x to centroid ci

Interpretation

The Hartigan index measures the scaled improvement when adding one more cluster:

Higher Hartigan values: Indicate that adding another cluster provides substantial improvement in within-cluster sum of squares

Lower Hartigan values: Suggest that adding another cluster provides diminishing returns

The optimal number of clusters is determined by finding the maximum difference between values between adjacent cluster numbers. This typically corresponds to the largest drop in the Hartigan index, indicating the point where adding more clusters no longer provides significant improvement.

Decision rule

A common rule of thumb is that if Hartigan(k) > 10, then k+1 clusters provide a significantly better solution than k clusters. The optimal number of clusters is often chosen as the largest k where Hartigan(k) > 10.

Advantages and considerations

The Hartigan index offers several advantages:

It provides a direct measure of clustering improvement

It's computationally efficient to calculate

It includes a penalty for the number of clusters through the (n-k-1) term

The interpretation is straightforward in terms of variance reduction

However, there are some limitations:

It's primarily designed for methods that minimize within-cluster sum of squares (like k-means)

It may not work well with non-spherical clusters

The index can be sensitive to outliers

The threshold value (10) is somewhat arbitrary and may not be appropriate for all datasets

The Hartigan index is particularly effective when used with k-means clustering or other methods that optimize within-cluster sum of squares, as it directly measures the improvement in the objective function being optimized.

The Hartigan index is one of 17 methods used in Prism's consensus approach for determining optimal cluster numbers, as described on the cluster metrics page.

© 1995-2019 GraphPad Software, LLC. All rights reserved.