Please enable JavaScript to view this site.

Features and functionality described on this page are available with Prism Enterprise.

The Ratkowsky index is a clustering validation metric proposed by Ratkowsky and Lance in 1978 that evaluates clustering quality based on the average proportion of variance explained by the clustering across all variables. This index measures how well the clustering captures the between-group variance relative to the total variance in the dataset.

The Ratkowsky index is based on the principle that good clustering should explain a large proportion of the total variance in the data through between-group differences, similar to the concept of R² in regression analysis.

Mathematical calculation

The Ratkowsky index is calculated as follows:

where:

is the average proportion of variance explained across all variables

k is the number of clusters

Average variance proportion

The average variance proportion is calculated as:

where:

p is the number of variables

BGSSj is the between-group sum of squares for variable j

TSSj is the total sum of squares for variable j

Between-group sum of squares

For each variable j, the between-group sum of squares is:

where:

ni is the number of points in cluster i

cij is the mean of variable j in cluster i

j is the overall mean of variable j

Total sum of squares

For each variable j, the total sum of squares is:

where:

n is the total number of data points

xlj is the value of variable j for point l

Interpretation

The Ratkowsky index measures the quality of clustering through variance explanation:

Higher Ratkowsky values: Indicate better clustering with larger proportions of variance explained by between-group differences

Lower Ratkowsky values: Suggest that clustering explains less of the total variance in the data

The optimal number of clusters corresponds to the maximum value of the Ratkowsky index. This occurs when the clustering captures the maximum amount of between-group variance while accounting for the number of clusters through the √k normalization.

Scaling factor

The division by √k serves as a penalty term that:

Prevents the index from always favoring more clusters

Balances variance explanation against cluster complexity

Provides a trade-off between fit and parsimony

Advantages and considerations

The Ratkowsky index offers several advantages:

It provides an intuitive measure based on variance explanation

It accounts for multiple variables simultaneously

The √k penalty prevents overfitting to too many clusters

It's analogous to familiar statistical concepts (like R²)

However, there are some limitations:

It assumes that variance explanation is the primary criterion for good clustering

It may favor spherical clusters due to its reliance on means and variances

Performance can be affected by variables with very different scales

It may not work well with non-linear cluster structures

The Ratkowsky index is particularly effective when:

Variables are measured on similar scales or have been standardized

Clusters are expected to be roughly spherical and similar in size

You want to maximize variance explanation across multiple dimensions

Working with clustering methods that optimize variance-based criteria (like k-means)

The index provides a multivariate extension of the concept of variance explanation and is especially useful when the goal is to find clusters that represent distinct groups across multiple measured characteristics.

The Ratkowsky index is one of 17 methods used in Prism's consensus approach for determining optimal cluster numbers, as described on the cluster metrics page.

© 1995-2019 GraphPad Software, LLC. All rights reserved.