Features and functionality described on this page are available with Prism Enterprise. |
This sheet provides a general overview of the analysis including important information about the experimental design, the inputs, and a summary of the clustering results. Contents of this sheet include:
•The name of the input data table analyzed
•A K-means summary table, including
oThe total number of clusters into which the algorithm attempted to fit the data
oThe total within cluster sum of squares (WCSS) calculated for a given number of total clusters. Note that the WCSS will always decrease as more clusters are used to group the data. When there is one cluster (all data assigned to the same cluster), the WCSS is equal to the total sum of squares. This value decreases as more clusters are added until the number of clusters is equal to the number of observations, at which point the WCSS will equal exactly zero.
oThe percent of total variation may also be thought of as the proportion (or percent) of unexplained variation. This value provides the fraction of the total variation that remains within the clusters after the model has been fit. Mathematically, this value is calculated as the WCSS divided by the total sum of squares of the data. A higher percent of total variation indicates that a large portion of the total variance is still inside the clusters. Thinking about this intuitively, this means that there is still a large amount of variance in each cluster, and thus the data in the clusters are quite spread out. In other words, a high percent of total variation suggests that the clustering did not do a great job in organizing the data into compact groups.
oThe percent of unexplained variance is the complement to the percent of total variance above. This value represents the relative amount of variation in the data that is “explained” by the model. If a single cluster is fit to the data, then the “within cluster sum of squares” (WCSS, see above) will simply be equal to the total sum of squares in the data. This means that with one cluster, the model has not contributed to understanding the variance of the data. With two clusters, the WCSS will decrease compared to the WCSS when only using one cluster. How much this value decreases can be used as a rough estimate for how much variance the model has explained. The formula for this is %Var = 1 - (WCSS/Total SS). A high percent of explained variance means that most of the variance is between the clusters rather than within them. Intuitively, this suggests that the clustering analysis has done a decent job in organizing the data into compact groups.
oSilhouette score - this is the average silhouette score across all observations after clustering them into the specified number of clusters (see above). The details of how a silhouette score is calculated involve a bit of math covered elsewhere, but the general concept is that a silhouette score for an observation represents how close (or far) this observation is from other points in its assigned cluster vs how close (or far) it is from the points of the next nearest cluster. This value can range from -1 to 1, with a values close to 1 indicating that an observation is well matched with others of its cluster. Identifying the number of clusters that result in the maximum silhouette score is one method (of many) to help identify an optimal number of clusters into which to group your data. These values are displayed graphically in the silhouette plot
oGap statistic - this value is only shown if the option to display the gap plot for the K-means clustering analysis was selected. Without getting into the details of how this value is calculated, the idea is that the gap statistic represents how much better your data are grouped into a specific number of clusters vs randomly simulated data with the same range. If groupings actually exist in your data, it’s assumed that your data would group into clusters better than randomly generated data. If your data doesn’t group into clusters better than randomly generated data, it suggests that maybe there are no inherent groupings in your data. This metric can take on any value, but in practice, it’s generally only ever zero or positive. A value of zero indicates that the input data do not group any better into the specified number of clusters than randomly generated data. A positive value indicates that your data group into the specified number of clusters better than randomly generated data, and the magnitude of the metric can generally be taken as a measure of how well your data cluster together.
oNumber of iterations - this indicates how many times the selected K-means algorithm had to loop in order to reach convergence.
oAlgorithm - this simply reports the algorithm used by the analysis. The default algorithm is the Hartigan-Wong algorithm
•(Optional) The "Optimal number of clusters" section (displayed if the option "Determine optimal number of clusters" was selected on the analysis options dialog). Within this section, you will find a summary of the consensus method of selecting the optimal number of clusters, including:
oThe best cluster count determined by the consensus method
oA list of metrics/methods used
oThe optimal cluster count determined from each method
oThe metric value corresponding to each method for the given optimal cluster count
•A summary of the input data, including
oThe number of variables (columns)
oThe total number of rows in the input table
oThe number of rows skipped (due to missing data)
oThe number of observations used in the analysis (total number of rows - rows skipped)