Features and functionality described on this page are available with Prism Enterprise. |
It may seem obvious or overly simplified, but clustering is simply a process that groups objects or observations in such a way that the most similar observations (objects, individuals, subjects, etc.) are grouped together. Specifically, clustering methods attempt to group observations in such a way that the objects within a group are more similar to each other than to those in other groups. The groups that the objects are organized into are called "clusters", and can be useful in identifying patterns or relationships within the data that may have otherwise been undetected. However, it's important to realize that "clustering" is not a single technique, but rather a collection of various machine learning algorithms that each have a general objective of creating these groups or clusters, but may have very different methods to generate those clusters and thus may arrive at very different results even when starting from the same dataset.
Currently, Prism offers two clustering analyses: Hierarchical clustering and K-means clustering. Some of the concepts discussed in this section (like methods for calculating distances) may apply to one or to both of these analyses, while others (like linkage method) will only apply to one but not the other. When necessary, each section will clearly indicate which concepts apply to each of the clustering analyses offered in Prism.
Linkage methods in hierarchical clustering
Selecting the optimal number of clusters
Prism provides multiple approaches for determining the optimal number of clusters in your data. For both K-means clustering and hierarchical clustering, you can choose from three individual graphical methods to help determine the optimal number of clusters (the Elbow plot, the Silhouette plot, and the Gap Plot). Or if performing K-means clustering, you can choose to use the comprehensive "Determine optimal number of clusters" option that automatically evaluates your data using 17 different cluster validation indices and applies a consensus method to recommend the best number of clusters.
Graphical methods
Elbow plot and within cluster sum of squares (WCSS)
Silhouette score and the silhouette plot
Consensus methods
Biological data is inherently complex and multi-dimensional. Often, we design experiments in such a way that we "control" for as many of these variables as possible so that we can explicitly examine the effects of one or two specific variables of interest. Much time and effort is dedicated to preparing our samples and tests in such a way that we can observe these effects without worrying about all of these other variables.
However, we often face scenarios in which we should not - or cannot - control for the variables in our samples. For example, data may be getting collected at multiple different locations, under different environmental conditions, by different researchers, in different regions, etc. Because the objective of clustering is to organize observations based on their similarities (or dis-similarities) across a wide array of different variables, it's able to highlight underlying patterns in data collected under these sorts of conditions that we may not have seen or expected otherwise.
Clustering analyses can be used for a wide variety of different tasks, such as:
•Understanding biological diversity in different samples such as genes, proteins, cells, or strains by categorizing them into related groups
•Aiding in data interpretation by revealing underlying structure of the data through visualizations such as dendrograms or clustered scatter plots
•Assist in developing hypotheses for future experiments by identifying potentially novel groups within a collected dataset
•Provide classification of new observations based on the patterns derived from previous data