Please enable JavaScript to view this site.

Features and functionality described on this page are available with Prism Enterprise.

The Point-biserial index is a clustering validation metric examined by Milligan (1980, 1981) and Kraemer (1982) that measures the point-biserial correlation between the raw input dissimilarity matrix and a binary matrix indicating cluster membership. This index evaluates how well the clustering assignment corresponds to the underlying distance structure of the data.

The Ptbiserial index is based on the principle that good clustering should show a strong correlation between cluster assignment and distance relationships. Points in the same cluster should have smaller distances, while points in different clusters should have larger distances.

Mathematical calculation

The Ptbiserial index is calculated as follows:

where:

b is the average between-cluster distance

w is the average within-cluster distance

Nw is the number of within-cluster pairs

Nb is the number of between-cluster pairs

Nt is the total number of pairs

sd is the standard deviation of all distances

Distance components

The average distances are calculated as:

where:

Sw is the sum of within-cluster distances

Sb is the sum of between-cluster distances

Pair counts

The number of pairs is calculated as:

where:

n is the total number of data points

ni is the number of points in cluster i

k is the number of clusters

Interpretation

The Point-biserial index measures the correlation between cluster assignments and distance relationships:

Higher Ptbiserial values: Indicate better clustering with a strong positive correlation between cluster membership and appropriate distance relationships

Lower Ptbiserial values: Suggest weaker correlation between clustering and distance structure

The optimal number of clusters corresponds to the maximum value of the Point-biserial index. This occurs when:

Within-cluster distances are consistently smaller than between-cluster distances

The clustering assignment best reflects the underlying distance structure

The correlation between binary cluster membership and distances is strongest

Conceptual foundation

The index works by creating a binary matrix where:

0 is assigned if two points are in the same cluster

1 is assigned if two points are in different clusters

The point-biserial correlation then measures how well this binary clustering indicator correlates with the actual distance matrix. High correlation indicates that the clustering successfully captures the distance structure.

Advantages and considerations

The Point-biserial index offers several advantages:

It provides a correlation-based measure with clear statistical interpretation

It directly relates clustering assignments to the underlying distance structure

It accounts for both within-cluster and between-cluster relationships

The calculation incorporates the variability of distances through the standard deviation

However, there are some limitations:

It requires calculation of all pairwise distances, which can be computationally expensive

It assumes that distance relationships should correlate with cluster membership

Performance may vary depending on the distance metric used

It may be sensitive to outliers that affect the standard deviation

The Point-biserial index is particularly useful when:

You want to assess how well clustering reflects distance relationships

The underlying distance metric has meaningful interpretation

You need a statistically grounded correlation measure

Working with clustering methods that should respect distance structure

The index is especially effective when combined with other validation measures to provide a comprehensive evaluation of clustering quality.

The Point-biserial index is one of 17 methods used in Prism's consensus approach for determining optimal cluster numbers, as described on the cluster metrics page.

© 1995-2019 GraphPad Software, LLC. All rights reserved.