Features and functionality described on this page are available with Prism Enterprise. |
The GPlus index is a clustering validation metric that was reviewed by Rohlf in 1974 and examined by Milligan in 1981. This index measures clustering quality by evaluating the proportion of discordant comparisons between within-cluster and between-cluster distances relative to the total number of possible comparisons.
The GPlus index is based on the principle that in good clustering, distances within clusters should be smaller than distances between clusters. It specifically focuses on cases where this expectation is violated (discordant comparisons).
The GPlus index is calculated as follows:
where:
•s(-) is the number of discordant comparisons
•Nt is the total number of pairwise distances in the dataset
s(-) represents the number of times where two points that were in the same cluster had a larger distance than two points not clustered together. In other words, it counts violations of the expectation that within-cluster distances should be smaller than between-cluster distances.
The total number of pairwise distances in a dataset of n points is:
1. Calculate all pairwise distances in the dataset
2. Identify which distances are within-cluster and which are between-cluster
3. For each within-cluster distance, count how many between-cluster distances are smaller
4. Sum these counts to get s(-) (discordant comparisons)
5. Calculate the GPlus index using the formula above
The GPlus index measures the proportion of discordant distance relationships:
•Lower GPlus values: Indicate better clustering with fewer violations of the within < between distance expectation
•Higher GPlus values: Suggest poor clustering with many cases where within-cluster distances exceed between-cluster distances
The optimal number of clusters corresponds to the minimum value of the GPlus index. This occurs when there are relatively few discordant comparisons, meaning most within-cluster distances are smaller than between-cluster distances.
The GPlus index offers several advantages:
•It provides a direct measure of clustering violations
•The interpretation is intuitive (proportion of "bad" comparisons)
•It doesn't make assumptions about cluster shape or distribution
•It's normalized by the total number of possible comparisons
However, there are significant limitations:
•Computationally very expensive: Requires extensive pairwise distance comparisons
•High computational demand: The number of comparisons grows quadratically with dataset size
•Memory requirements: Can become prohibitive for large datasets
•Sensitivity to outliers: A few extreme distances can disproportionately affect the index
Due to its high computational cost, the GPlus index is typically calculated only when specifically requested (e.g., using index = "gplus" or index = "alllong" in the NbClust package) and may not be practical for large datasets without substantial computational resources.
The GPlus index is particularly useful when you want a comprehensive assessment of how well the clustering respects distance relationships, but it should be used judiciously due to its computational requirements.
The GPlus index is one of 17 methods used in Prism's consensus approach for determining optimal cluster numbers, as described on the cluster metrics page.