Principal Component Analysis (PCA) is a powerful exploratory model that reduces the dimension of your data. It’s particularly useful when you have a lot of variables (columns). It can even be used with tables that have more columns than rows!
The main purposes for PCA are:
1.Visualizing your data for exploratory analyses. You can uncover interesting characteristics of your data by plotting the rows of data along any two principal components with a score plot or the columns of data with a loadings plot.
2.Reducing the number of predictors for future analyses, such as Principal Component Regression.
PCA uses a bit of complex mathematics (computational linear algebra) to determine the underlying linear structure inherent in the matrix (rows and columns) of data. The primary mathematics behind PCA is the Singular Value Decomposition, which is a generalization of an eigenvalue decomposition.
Understanding how these methods work is not strictly necessary for understanding PCA and its results. However, understanding the basic principles of the concepts involved can be extremely helpful when interpreting PCA results.
This page provides some of the technical details involved in how PCA is performed and what it can tell you about your data.
PCA works by extracting linear relationships in the data. Use of these linear relationships is often sufficient in practice, but admittedly some of its popularity is because the assumption of linearity greatly simplifies calculations.
A major limitation of PCA is that it is blind to nonlinear relationships. For example, consider three columns of data, X1, X2, and X3. If X1 = X2*X3 (a nonlinear relationship), then PCA won’t be able to accurately extract that relationship. In contrast, PCA is very capable of extracting more complicated relationships of variables that exhibit linear relationships.
One common point of confusion is that - unlike most statistical models - PCA by itself does not require you to define a response variable. Instead, all of the variables are entered as predictors. However, as mentioned, PCA is often used as a precursor to further analyses. One of the most common analyses following PCA is Principal Component Regression (PCR). In order to perform PCR, you must designate an outcome variable, which cannot be one of the variables entered into the PCA.
Another common point of confusion is the relationship between PCA and Factor Analysis (FA). Factor Analysis is popular with social sciences and attempts to find interpretable linear relationships among the variables, called factors. In other words, Factor Analysis relies on the concept that there is an "underlying" or "latent" factor that can't be directly measured, but that causes the pattern of measured values in the variables in the dataset. The principal components in PCA do not have the same interpretation. Instead, PCA is simply a useful process to reduce the number of observed variables to a smaller set of independent components. The advantages with PCA are the score, loadings and bi-plots and the ability to run further analysis by using the dimension-reduced scores. GraphPad Prism does not perform FA (yet).