GraphPad Prism 10 Statistics Guide

Zoom Window Out
Larger Text | Smaller Text
Hide Page Header
Show Expanding Text
Printable Version
Save Permalink URL

Navigation: STATISTICS WITH PRISM 10 > Clustering > The primary concepts of clustering > Selecting the optimal number of clusters

Ratkowsky Index

Scroll Prev Top Next More

Features and functionality described on this page are available with Prism Enterprise.

The Ratkowsky index is a clustering validation metric proposed by Ratkowsky and Lance in 1978 that evaluates clustering quality based on the average proportion of variance explained by the clustering across all variables. This index measures how well the clustering captures the between-group variance relative to the total variance in the dataset.

The Ratkowsky index is based on the principle that good clustering should explain a large proportion of the total variance in the data through between-group differences, similar to the concept of R² in regression analysis.

Mathematical calculation

The Ratkowsky index is calculated as follows:

where:

•S̄ is the average proportion of variance explained across all variables

•k is the number of clusters

Average variance proportion

The average variance proportion is calculated as:

where:

•p is the number of variables

•BGSSj is the between-group sum of squares for variable j

•TSSj is the total sum of squares for variable j

Between-group sum of squares

For each variable j, the between-group sum of squares is:

where:

•ni is the number of points in cluster i

•cij is the mean of variable j in cluster i

•x̄j is the overall mean of variable j

Total sum of squares

For each variable j, the total sum of squares is:

where:

•n is the total number of data points

•xlj is the value of variable j for point l

Interpretation

The Ratkowsky index measures the quality of clustering through variance explanation:

•Higher Ratkowsky values: Indicate better clustering with larger proportions of variance explained by between-group differences

•Lower Ratkowsky values: Suggest that clustering explains less of the total variance in the data

The optimal number of clusters corresponds to the maximum value of the Ratkowsky index. This occurs when the clustering captures the maximum amount of between-group variance while accounting for the number of clusters through the √k normalization.

Scaling factor

The division by √k serves as a penalty term that:

•Prevents the index from always favoring more clusters

•Balances variance explanation against cluster complexity

•Provides a trade-off between fit and parsimony

Advantages and considerations

The Ratkowsky index offers several advantages:

•It provides an intuitive measure based on variance explanation

•It accounts for multiple variables simultaneously

•The √k penalty prevents overfitting to too many clusters

•It's analogous to familiar statistical concepts (like R²)

However, there are some limitations:

•It assumes that variance explanation is the primary criterion for good clustering

•It may favor spherical clusters due to its reliance on means and variances

•Performance can be affected by variables with very different scales

•It may not work well with non-linear cluster structures

The Ratkowsky index is particularly effective when:

•Variables are measured on similar scales or have been standardized

•Clusters are expected to be roughly spherical and similar in size

•You want to maximize variance explanation across multiple dimensions

•Working with clustering methods that optimize variance-based criteria (like k-means)

The index provides a multivariate extension of the concept of variance explanation and is especially useful when the goal is to find clusters that represent distinct groups across multiple measured characteristics.

The Ratkowsky index is one of 17 methods used in Prism's consensus approach for determining optimal cluster numbers, as described on the cluster metrics page.

Please enable JavaScript to view this site.

Mathematical calculation

Average variance proportion

Between-group sum of squares

Total sum of squares

Interpretation

Scaling factor

Advantages and considerations