Features and functionality described on this page are available with Prism Enterprise. |
The Point-biserial index is a clustering validation metric examined by Milligan (1980, 1981) and Kraemer (1982) that measures the point-biserial correlation between the raw input dissimilarity matrix and a binary matrix indicating cluster membership. This index evaluates how well the clustering assignment corresponds to the underlying distance structure of the data.
The Ptbiserial index is based on the principle that good clustering should show a strong correlation between cluster assignment and distance relationships. Points in the same cluster should have smaller distances, while points in different clusters should have larger distances.
The Ptbiserial index is calculated as follows:
where:
•S̄b is the average between-cluster distance
•S̄w is the average within-cluster distance
•Nw is the number of within-cluster pairs
•Nb is the number of between-cluster pairs
•Nt is the total number of pairs
•sd is the standard deviation of all distances
The average distances are calculated as:
where:
•Sw is the sum of within-cluster distances
•Sb is the sum of between-cluster distances
The number of pairs is calculated as:
where:
•n is the total number of data points
•ni is the number of points in cluster i
•k is the number of clusters
The Point-biserial index measures the correlation between cluster assignments and distance relationships:
•Higher Ptbiserial values: Indicate better clustering with a strong positive correlation between cluster membership and appropriate distance relationships
•Lower Ptbiserial values: Suggest weaker correlation between clustering and distance structure
The optimal number of clusters corresponds to the maximum value of the Point-biserial index. This occurs when:
•Within-cluster distances are consistently smaller than between-cluster distances
•The clustering assignment best reflects the underlying distance structure
•The correlation between binary cluster membership and distances is strongest
The index works by creating a binary matrix where:
•0 is assigned if two points are in the same cluster
•1 is assigned if two points are in different clusters
The point-biserial correlation then measures how well this binary clustering indicator correlates with the actual distance matrix. High correlation indicates that the clustering successfully captures the distance structure.
The Point-biserial index offers several advantages:
•It provides a correlation-based measure with clear statistical interpretation
•It directly relates clustering assignments to the underlying distance structure
•It accounts for both within-cluster and between-cluster relationships
•The calculation incorporates the variability of distances through the standard deviation
However, there are some limitations:
•It requires calculation of all pairwise distances, which can be computationally expensive
•It assumes that distance relationships should correlate with cluster membership
•Performance may vary depending on the distance metric used
•It may be sensitive to outliers that affect the standard deviation
The Point-biserial index is particularly useful when:
•You want to assess how well clustering reflects distance relationships
•The underlying distance metric has meaningful interpretation
•You need a statistically grounded correlation measure
•Working with clustering methods that should respect distance structure
The index is especially effective when combined with other validation measures to provide a comprehensive evaluation of clustering quality.
The Point-biserial index is one of 17 methods used in Prism's consensus approach for determining optimal cluster numbers, as described on the cluster metrics page.