Features and functionality described on this page are available with Prism Enterprise. |
The C Index is a clustering validation metric that was reviewed by Hubert and Levin in 1976. This index evaluates clustering quality by comparing the sum of within-cluster distances to the range of possible distance sums, providing a normalized measure of how well the clustering minimizes internal distances.
The C Index is based on the principle that good clustering should result in small within-cluster distances relative to what would be expected by chance. It normalizes the within-cluster distance sum against the minimum and maximum possible sums.
The C Index is calculated as follows:
where:
•Sw is the sum of within-cluster distances
•Smin is the sum of the Nw smallest distances in the entire dataset
•Smax is the sum of the Nw largest distances in the entire dataset
•Nw is the number of within-cluster pairs
The index is only defined when Smin ≠ Smax, and C ∈ (0, 1).
The sum of within-cluster distances is:
The total number of within-cluster pairs is:
where ni is the number of points in cluster i.
•Smin: The sum of the Nw smallest distances among all possible pairwise distances in the dataset
•Smax: The sum of the Nw largest distances among all possible pairwise distances in the dataset
These bounds represent the best and worst possible cases for the sum of Nw distances.
The C Index provides a normalized measure of clustering quality:
•Lower C values (closer to 0): Indicate better clustering, where within-cluster distances are close to the minimum possible sum
•Higher C values (closer to 1): Suggest poor clustering, where within-cluster distances approach the maximum possible sum
The optimal number of clusters corresponds to the minimum value of the C Index. This occurs when the within-cluster distances are as small as possible relative to the range of potential distance sums.
The value Sw represents the sum of within-cluster distances between points. The value Smin represents the absolute minimum possible sum of distances in the dataset. Therefore, the difference Sw - Smin effectively represents "how much greater" the sum of distances is with the applied clustering method than the absolute minimum. Similarly, Smax represents the absolute maximum possible sum of distances, and thus the difference Smax - Smin represents the greatest possible range between sums possible in the dataset. By dividing these two differences, you will obtain a value between 0 and 1, where the value indicates how well the clustering performed between the best possible (C = 0) and the worst possible (C = 1) solution.
•C = 0 the within-cluster distances represent the smallest possible values, suggesting that the clustering somehow managed to pair up the closest possible points, indicating the best possible clustering
•C = 1 the within-cluster distances represent the largest possible values, suggesting that our clustering somehow managed to pair up the farthest possible pairs, indicating the worst possible clustering (even worse than assigning at random!)
Good clustering methods should result in grouping close points together, while placing points farther away into separate clusters. This minimizes the within-cluster distances. By normalizing our result to the bounds of what's actually possible with the data, we obtain a scale-independent measure of clustering quality.
The C Index offers several advantages:
•It provides a normalized measure that is easily interpretable (0 = best, 1 = worst)
•It's scale-independent, making it suitable for comparing clustering results across different datasets
•It doesn't make assumptions about cluster shape, size, or distribution
•The bounds are mathematically guaranteed based on the actual data, not theoretical assumptions
However, there are some limitations:
•It requires calculation of all pairwise distances, which can be computationally expensive for large datasets (O(n²) complexity)
•It may be sensitive to outliers, as extreme distances can disproportionately affect Smin and Smax
•The index doesn't directly account for cluster separation - it focuses primarily on within-cluster compactness
•Performance can vary depending on the choice of distance metric used
The C Index is particularly useful when:
•You want a bounded, interpretable measure of clustering quality
•Computational resources allow for extensive distance calculations
•You need a method that doesn't assume specific cluster characteristics
•You're comparing clustering solutions across different algorithms or parameters
The C Index is one of 17 methods used in Prism's consensus approach for determining optimal cluster numbers, as described on the cluster metrics page.