Features and functionality described on this page are available with Prism Enterprise. |
The Ratkowsky index is a clustering validation metric proposed by Ratkowsky and Lance in 1978 that evaluates clustering quality based on the average proportion of variance explained by the clustering across all variables. This index measures how well the clustering captures the between-group variance relative to the total variance in the dataset.
The Ratkowsky index is based on the principle that good clustering should explain a large proportion of the total variance in the data through between-group differences, similar to the concept of R² in regression analysis.
The Ratkowsky index is calculated as follows:
where:
•S̄ is the average proportion of variance explained across all variables
•k is the number of clusters
The average variance proportion is calculated as:
where:
•p is the number of variables
•BGSSj is the between-group sum of squares for variable j
•TSSj is the total sum of squares for variable j
For each variable j, the between-group sum of squares is:
where:
•ni is the number of points in cluster i
•cij is the mean of variable j in cluster i
•x̄j is the overall mean of variable j
For each variable j, the total sum of squares is:
where:
•n is the total number of data points
•xlj is the value of variable j for point l
The Ratkowsky index measures the quality of clustering through variance explanation:
•Higher Ratkowsky values: Indicate better clustering with larger proportions of variance explained by between-group differences
•Lower Ratkowsky values: Suggest that clustering explains less of the total variance in the data
The optimal number of clusters corresponds to the maximum value of the Ratkowsky index. This occurs when the clustering captures the maximum amount of between-group variance while accounting for the number of clusters through the √k normalization.
The division by √k serves as a penalty term that:
•Prevents the index from always favoring more clusters
•Balances variance explanation against cluster complexity
•Provides a trade-off between fit and parsimony
The Ratkowsky index offers several advantages:
•It provides an intuitive measure based on variance explanation
•It accounts for multiple variables simultaneously
•The √k penalty prevents overfitting to too many clusters
•It's analogous to familiar statistical concepts (like R²)
However, there are some limitations:
•It assumes that variance explanation is the primary criterion for good clustering
•It may favor spherical clusters due to its reliance on means and variances
•Performance can be affected by variables with very different scales
•It may not work well with non-linear cluster structures
The Ratkowsky index is particularly effective when:
•Variables are measured on similar scales or have been standardized
•Clusters are expected to be roughly spherical and similar in size
•You want to maximize variance explanation across multiple dimensions
•Working with clustering methods that optimize variance-based criteria (like k-means)
The index provides a multivariate extension of the concept of variance explanation and is especially useful when the goal is to find clusters that represent distinct groups across multiple measured characteristics.
The Ratkowsky index is one of 17 methods used in Prism's consensus approach for determining optimal cluster numbers, as described on the cluster metrics page.