GraphPad Prism 10 Statistics Guide

Zoom Window Out
Larger Text | Smaller Text
Hide Page Header
Show Expanding Text
Printable Version
Save Permalink URL

Navigation: STATISTICS WITH PRISM 10 > Clustering > The primary concepts of clustering > Selecting the optimal number of clusters

C Index

Scroll Prev Top Next More

Features and functionality described on this page are available with Prism Enterprise.

The C Index is a clustering validation metric that was reviewed by Hubert and Levin in 1976. This index evaluates clustering quality by comparing the sum of within-cluster distances to the range of possible distance sums, providing a normalized measure of how well the clustering minimizes internal distances.

The C Index is based on the principle that good clustering should result in small within-cluster distances relative to what would be expected by chance. It normalizes the within-cluster distance sum against the minimum and maximum possible sums.

Mathematical calculation

The C Index is calculated as follows:

where:

•Sw is the sum of within-cluster distances

•Smin is the sum of the Nw smallest distances in the entire dataset

•Smax is the sum of the Nw largest distances in the entire dataset

•Nw is the number of within-cluster pairs

The index is only defined when Smin ≠ Smax, and C ∈ (0, 1).

Within-cluster distances

The sum of within-cluster distances is:

The total number of within-cluster pairs is:

where ni is the number of points in cluster i.

Reference bounds

•Smin: The sum of the Nw smallest distances among all possible pairwise distances in the dataset

•Smax: The sum of the Nw largest distances among all possible pairwise distances in the dataset

These bounds represent the best and worst possible cases for the sum of Nw distances.

Interpretation

The C Index provides a normalized measure of clustering quality:

•Lower C values (closer to 0): Indicate better clustering, where within-cluster distances are close to the minimum possible sum

•Higher C values (closer to 1): Suggest poor clustering, where within-cluster distances approach the maximum possible sum

The optimal number of clusters corresponds to the minimum value of the C Index. This occurs when the within-cluster distances are as small as possible relative to the range of potential distance sums.

Conceptual foundation

The value Sw represents the sum of within-cluster distances between points. The value Smin represents the absolute minimum possible sum of distances in the dataset. Therefore, the difference Sw - Smin effectively represents "how much greater" the sum of distances is with the applied clustering method than the absolute minimum. Similarly, Smax represents the absolute maximum possible sum of distances, and thus the difference Smax - Smin represents the greatest possible range between sums possible in the dataset. By dividing these two differences, you will obtain a value between 0 and 1, where the value indicates how well the clustering performed between the best possible (C = 0) and the worst possible (C = 1) solution.

•C = 0 the within-cluster distances represent the smallest possible values, suggesting that the clustering somehow managed to pair up the closest possible points, indicating the best possible clustering

•C = 1 the within-cluster distances represent the largest possible values, suggesting that our clustering somehow managed to pair up the farthest possible pairs, indicating the worst possible clustering (even worse than assigning at random!)

Good clustering methods should result in grouping close points together, while placing points farther away into separate clusters. This minimizes the within-cluster distances. By normalizing our result to the bounds of what's actually possible with the data, we obtain a scale-independent measure of clustering quality.

Advantages and considerations

The C Index offers several advantages:

•It provides a normalized measure that is easily interpretable (0 = best, 1 = worst)

•It's scale-independent, making it suitable for comparing clustering results across different datasets

•It doesn't make assumptions about cluster shape, size, or distribution

•The bounds are mathematically guaranteed based on the actual data, not theoretical assumptions

However, there are some limitations:

•It requires calculation of all pairwise distances, which can be computationally expensive for large datasets (O(n²) complexity)

•It may be sensitive to outliers, as extreme distances can disproportionately affect Smin and Smax

•The index doesn't directly account for cluster separation - it focuses primarily on within-cluster compactness

•Performance can vary depending on the choice of distance metric used

The C Index is particularly useful when:

•You want a bounded, interpretable measure of clustering quality

•Computational resources allow for extensive distance calculations

•You need a method that doesn't assume specific cluster characteristics

•You're comparing clustering solutions across different algorithms or parameters

The C Index is one of 17 methods used in Prism's consensus approach for determining optimal cluster numbers, as described on the cluster metrics page.