Please enable JavaScript to view this site.

Features and functionality described on this page are available with Prism Enterprise.

The C Index is a clustering validation metric that was reviewed by Hubert and Levin in 1976. This index evaluates clustering quality by comparing the sum of within-cluster distances to the range of possible distance sums, providing a normalized measure of how well the clustering minimizes internal distances.

The C Index is based on the principle that good clustering should result in small within-cluster distances relative to what would be expected by chance. It normalizes the within-cluster distance sum against the minimum and maximum possible sums.

Mathematical calculation

The C Index is calculated as follows:

where:

Sw is the sum of within-cluster distances

Smin is the sum of the Nw smallest distances in the entire dataset

Smax is the sum of the Nw largest distances in the entire dataset

Nw is the number of within-cluster pairs

The index is only defined when Smin ≠ Smax, and C ∈ (0, 1).

Within-cluster distances

The sum of within-cluster distances is:

The total number of within-cluster pairs is:

where ni is the number of points in cluster i.

Reference bounds

Smin: The sum of the Nw smallest distances among all possible pairwise distances in the dataset

Smax: The sum of the Nw largest distances among all possible pairwise distances in the dataset

These bounds represent the best and worst possible cases for the sum of Nw distances.

Interpretation

The C Index provides a normalized measure of clustering quality:

Lower C values (closer to 0): Indicate better clustering, where within-cluster distances are close to the minimum possible sum

Higher C values (closer to 1): Suggest poor clustering, where within-cluster distances approach the maximum possible sum

The optimal number of clusters corresponds to the minimum value of the C Index. This occurs when the within-cluster distances are as small as possible relative to the range of potential distance sums.

Conceptual foundation

The value Sw represents the sum of within-cluster distances between points. The value Smin represents the absolute minimum possible sum of distances in the dataset. Therefore, the difference Sw - Smin effectively represents "how much greater" the sum of distances is with the applied clustering method than the absolute minimum. Similarly, Smax represents the absolute maximum possible sum of distances, and thus the difference Smax - Smin represents the greatest possible range between sums possible in the dataset. By dividing these two differences, you will obtain a value between 0 and 1, where the value indicates how well the clustering performed between the best possible (C = 0) and the worst possible (C = 1) solution.

C = 0 the within-cluster distances represent the smallest possible values, suggesting that the clustering somehow managed to pair up the closest possible points, indicating the best possible clustering

C = 1 the within-cluster distances represent the largest possible values, suggesting that our clustering somehow managed to pair up the farthest possible pairs, indicating the worst possible clustering (even worse than assigning at random!)

Good clustering methods should result in grouping close points together, while placing points farther away into separate clusters. This minimizes the within-cluster distances. By normalizing our result to the bounds of what's actually possible with the data, we obtain a scale-independent measure of clustering quality.

Advantages and considerations

The C Index offers several advantages:

It provides a normalized measure that is easily interpretable (0 = best, 1 = worst)

It's scale-independent, making it suitable for comparing clustering results across different datasets

It doesn't make assumptions about cluster shape, size, or distribution

The bounds are mathematically guaranteed based on the actual data, not theoretical assumptions

However, there are some limitations:

It requires calculation of all pairwise distances, which can be computationally expensive for large datasets (O(n²) complexity)

It may be sensitive to outliers, as extreme distances can disproportionately affect Smin and Smax

The index doesn't directly account for cluster separation - it focuses primarily on within-cluster compactness

Performance can vary depending on the choice of distance metric used

The C Index is particularly useful when:

You want a bounded, interpretable measure of clustering quality

Computational resources allow for extensive distance calculations

You need a method that doesn't assume specific cluster characteristics

You're comparing clustering solutions across different algorithms or parameters

The C Index is one of 17 methods used in Prism's consensus approach for determining optimal cluster numbers, as described on the cluster metrics page.

© 1995-2019 GraphPad Software, LLC. All rights reserved.