Features and functionality described on this page are available with Prism Enterprise. |
The Gamma index is a clustering validation metric that represents an adaptation of Goodman and Kruskal's Gamma statistic for clustering evaluation. Developed by Baker and Hubert in 1975, this index measures the agreement between within-cluster and between-cluster dissimilarities to assess clustering quality.
The Gamma index is based on the principle that in good clustering, distances within clusters should be consistently smaller than distances between clusters. It evaluates this by comparing all within-cluster dissimilarities against all between-cluster dissimilarities.
The Gamma index is calculated as follows:
where:
•s(+) is the number of concordant comparisons
•s(-) is the number of discordant comparisons
The comparisons are made between all within-cluster dissimilarities and all between-cluster dissimilarities:
•Concordant comparison s(+): A comparison where a within-cluster dissimilarity is strictly less than a between-cluster dissimilarity
•Discordant comparison s(-): A comparison where a within-cluster dissimilarity is strictly greater than a between-cluster dissimilarity
Note that equal dissimilarities between the two sets are disregarded in the calculation of the index.
1. Calculate all pairwise distances within each cluster (within-cluster dissimilarities)
2. Calculate all pairwise distances between points in different clusters (between-cluster dissimilarities)
3. For each within-cluster distance, compare it with each between-cluster distance:
•If within-cluster distance < between-cluster distance → increment s(+)
•If within-cluster distance > between-cluster distance → increment s(-)
•If distances are equal → ignore this comparison
4. Calculate the Gamma index using the formula above
The Gamma index ranges from -1 to +1 and measures the quality of clustering:
•Higher Gamma values (closer to +1): Indicate better clustering where within-cluster distances are consistently smaller than between-cluster distances
•Lower Gamma values (closer to -1): Suggest poor clustering where within-cluster distances are often larger than between-cluster distances
•Gamma values near 0: Indicate that within-cluster and between-cluster distances are similar on average
The optimal number of clusters corresponds to the maximum value of the Gamma index.
The Gamma index offers several advantages:
•It provides a comprehensive comparison of all distance relationships
•It doesn't make assumptions about cluster shape or distribution
•The interpretation is intuitive and well-founded statistically
•It's based on established statistical theory (Goodman-Kruskal Gamma)
However, there are significant limitations:
•Computationally very expensive: Requires comparison of all within-cluster distances with all between-cluster distances
•High computational demand: The number of comparisons grows rapidly with dataset size
•Time complexity: Can become prohibitive for large datasets
•Memory requirements: May require substantial computational resources
Due to its high computational cost, the Gamma index is typically calculated only when specifically requested (e.g., using index = "gamma" or index = "alllong" in the NbClust package) and may not be suitable for large datasets without sufficient computational resources.
The Gamma index is one of 17 methods used in Prism's consensus approach for determining optimal cluster numbers, as described on the cluster metrics page.