Please enable JavaScript to view this site.

Features and functionality described on this page are available with Prism Enterprise.

The Tau index is a clustering validation metric reviewed by Rohlf in 1974 and tested by Milligan in 1981 that measures the correlation between distance matrices and cluster assignments. This index evaluates clustering quality by comparing the distance matrix with a binary matrix indicating whether pairs of points belong to the same cluster or different clusters.

The Tau index is based on the principle that good clustering should show strong agreement between distance relationships and cluster membership assignments, similar to a rank correlation coefficient.

Mathematical calculation

The Tau index is calculated as follows:

where:

s(+) is the number of concordant comparisons

s(-) is the number of discordant comparisons

Nt is the total number of distances

T is the number of tied comparisons

Concordant and discordant comparisons

The comparisons are made between distance relationships and cluster membership:

Concordant comparison s(+): Cases where two points not clustered together have a larger distance than two points within the same cluster

Discordant comparison s(-): Cases where two points within the same cluster have a larger distance than two points not clustered together

Tied comparisons

T represents the number of comparisons where both pairs of points represent:

Both pairs are within-cluster comparisons, OR

Both pairs are between-cluster comparisons

These tied cases are excluded from the concordant/discordant classification.

Total distances

The total number of pairwise distances in a dataset of n points:

Interpretation

The Tau index measures the agreement between distance structure and clustering assignments:

Higher Tau values (closer to +1): Indicate better clustering with strong agreement between distances and cluster assignments

Lower Tau values (closer to -1): Suggest poor clustering with disagreement between distance structure and clustering

Tau values near 0: Indicate random or weak relationship between distances and clustering

The optimal number of clusters corresponds to the maximum value of the Tau index. This occurs when there is the strongest correlation between the distance matrix and the cluster assignment pattern.

Computational considerations

The Tau index involves extensive pairwise comparisons:

1. Calculate all pairwise distances in the dataset

2. Create a binary matrix indicating cluster membership for each pair

3. Compare each distance relationship with each cluster membership relationship

4. Count concordant, discordant, and tied comparisons

5. Apply the normalization formula

Advantages and considerations

The Tau index offers several advantages:

It provides a correlation-based measure with clear statistical interpretation

It's similar to established rank correlation coefficients (like Kendall's τ)

It doesn't make assumptions about cluster shape or distribution

The normalization accounts for tied comparisons appropriately

However, there are significant limitations:

Computationally very expensive: Requires extensive pairwise comparisons

High computational demand: Processing time grows rapidly with dataset size

Memory requirements: Can become prohibitive for large datasets

Complex calculation: The tied comparison adjustment adds computational complexity

Due to its high computational cost, the Tau index may not be practical for large datasets.

The Tau index is particularly useful when:

You want a rank correlation-based validation measure

The dataset is small to medium-sized

You need a comprehensive assessment of distance-clustering agreement

Working with applications where computational resources are adequate

The index provides a thorough evaluation of how well clustering assignments correspond to the underlying distance structure, but may cause performance issues when used with very large datasets due to its computational requirements.

The Tau index is one of 17 methods used in Prism's consensus approach for determining optimal cluster numbers, as described on the cluster metrics page.

© 1995-2019 GraphPad Software, LLC. All rights reserved.