Features and functionality described on this page are available with Prism Enterprise. |
When performing a clustering analysis, a crucial step is specifying or selecting the optimal number of clusters into which to cluster your data. Clustering analyses are known as “unsupervised machine learning techniques” because they operate without using labels on the data to inform the algorithm which group each observation belongs to. Instead, these analyses identify patterns or relationships within the data without any specific guidance from the human researcher (this is why they’re called “unsupervised”). Because of this, there’s no way for the clustering algorithms to know a priori how many clusters they should be grouping the data into.
In most cases, it’s up to you as the “human supervisor” of the machine learning algorithm to make the decision as to how many clusters is the “optimal” number of clusters. If we knew exactly how many groups were present in the data, we probably wouldn’t need to run clustering analysis, but deciding how many clusters to use can be difficult. If you choose too many clusters, then you may inadvertently cause the algorithm to overfit the data (reducing the generalizability of the results). If too few clusters, then groups that are actually distinct may be merged, resulting in valuable distinctions being lost. Fortunately, clustering algorithms can provide some information that can assist us in making this decision about the optimal number of clusters.
For clustering analyses in Prism, information that can assist in selecting the optimal number of clusters may be provided graphically through one of three different graph types (along with corresponding analytical values):
1.Elbow plot (and within-cluster sum of squares)
2.Silhouette plot (and silhouette score)
3.Gap plot (and gap statistic)
The linked pages will discuss each of these graph types and the analytical scores associated with them. It will also demonstrate how each of these values relate to each other, and provide recommendations around how to use this information to make a decision about the optimal number of clusters to choose for a given clustering analysis on a given data set.
For K-means clustering, Prism also offers the option to determine the optimal number of clusters automatically. This is done by calculating indices from 17 different methods, each developed to try and determine the optimal number of clusters for a given dataset. The optimal number of clusters for each of these methods is calculated, and a "consensus" is determined as the number of clusters most frequently identified as optimal across this set of methods. The methods used in Prism (and their calculations and interpretations) can be found on the following pages:
2. C index
5. Dunn index
6. Frey index
7. Gamma index
9. GPlus index
10. Hartigan index
14. Ratkowsky index
15. Silhouette score
16. Tau index
17. TraceW index
Note that for these consensus methods, a minimum of k=2 clusters is considered. The reason for this is two-fold. First, many of the methods are unable to calculate the corresponding indices for k=1 clusters (in other words, when ALL of the data are considered as a single cluster and there's no additional clustering applied). The other reason is simply because this process is intended to help you determine how many clusters there are within your data. Looking at k=1 (a single cluster for all of your data) as the optimal solution is similar to saying "there aren't any clusters". It's assumed that you've already determined that there should be clusters within your data (or that you expect clusters in your data), so this tool uses that assumption to determine how many clusters would be optimal based on your specific data.
Selecting the optimal number of clusters is a decision that can only truly be made with full knowledge of the scientific background of the experiment, the conditions under which the data were collected, what the data represent in the “real world”, and a myriad of other factors. The graphs and associated values on this page can provide suggestions or guidance to consider when making this decision, but ultimately there’s no single “right” way to choose an optimal cluster number.
Additionally, it must be noted that the “optimal” number of clusters heavily depends on the specific data being analyzed as well as the parameters of the clustering analysis. If any of the values in the data change, or any of the analysis options are modified, the “optimal” number of clusters may very likely change as well!
With that out of the way, let’s get into the methods that Prism provides to help decide on the optimal number of clusters from a clustering analysis.