Navigation: STATISTICS WITH PRISM 10 > Principal Component Analysis

Q & A: Principal Component Analysis

Principal Component Analysis (PCA) is an unsupervised* learning method that uses patterns present in high-dimensional data (data with lots of independent variables) to reduce the complexity of the data while retaining most of the information.

*Unsupervised is a term used in machine learning to indicate that a technique does not use outcomes or labels when processing the data. To understand this, compare it to supervised learning. Regression is an example of a supervised learning method because it uses a known set of outcome values (independent variables) and builds a model to connect the predictor variables (sometimes called "features" in machine learning) to these outcomes. In contrast, an unsupervised learning method (like PCA) does not use any labels (outcomes) when conducting the analysis. You don't define any outcome (dependent) or predictor (independent) variables. Instead, it simply looks at the properties of the data (in the case of PCA, it uses the variance in the data).

Big Picture

When is PCA helpful?

•Because the primary objective of PCA is to reduce the number of variables required to describe a dataset, it is most useful when there are too many variables in the data to explore/visualize easily

•Variables in a dataset may exhibit multicollinearity, meaning that there is a significant amount of correlation between two or more variables. This means that the values of one variable can be described by the values of another. However, for many statistical models, it's important that the variables be independent from each other (hence the common term "independent variables). If this isn't the case (i.e. when variables exhibit multicollinearity), interpretation of results of various statistical models or analyses becomes difficult or even impossible. The principal components generated by PCA exhibit no collinearity. Another way to say this is that each of the principal components are perfectly orthogonal to each other (their correlation with other principal components is zero)

•When the principal components are used as input to multiple regression, PCA can help eliminate problems with overfitting (a problem which occurs when a model fits too closely to the sample data, and will perform poorly when predicting values from the larger population from which the data were sampled). This often happens because there are too many variables in the data compared to the number of observations. In these situations, noise (random error) in the data will have too large of an impact on the model. Because PCA can be used to reduce the number of variables, it can help overcome problems with overfitting

Is PCA the same as variable selection?

•No. In PCA, each principal component (PC) is a linear combination of every single original variable. Information from all variables is used to define each PC. In contrast, the process of variable selection involves the elimination of entire variables from the dataset based on given criteria. Prism does not offer any form of automatic variable selection

Analysis Choices

Why is the choice for PCR (Principal Component Regression) gray (not available)?

•Performing PCR requires that a dependent variable be chosen. This dependent variable must not also be included in the PCA. By default, Prism chooses all (continuous) variables to be included in the PCA, so there are no available variables to select as the dependent variable for PCR. As a result, PCR is grayed out. As soon as a variable is unselected from the list of variables to be part of the PCA, the choice for PCR will become available

Should I center my data? Should I scale my data?

•When in doubt, standardize your data

•Centering data involves first determining the mean value for each variable, and then subtracting that mean from each value in the variable. In the resulting dataset, every variable has a mean of zero. Note that centering alone does not change the standard deviation of a variable

•Standardizing data involves first centering (see above) the variables. Then, the standard deviation for each variable is determined, and every centered value is divided by the standard deviation of its variable. This results in a dataset for which every variable has a mean of zero and a standard deviation of 1 (and thus a variance of 1)

•It is rare to run PCA on data that is neither centered and standardized (although it is done in a small number of disciplines). Prism does not offer this option

•PCA works by analyzing the variance of a dataset. Variables with larger variances have a greater impact on the results of PCA. However, differences in variance may simply be due to differences in measurement scales (for example, length measurements in millimeters will have greater variance than the same length measurements in meters due only to the measurement scale). In some cases, it may be important to preserve the relationship of variances in the dataset, but generally it is recommended to standardize the data (setting the variance of each variable equal to 1, see above)

How should I choose the number of PCs to retain?

•We recommend using Parallel Analysis (PA) as a means to select the number of PCs to retain. Other methods based on eigenvalues (Kaiser rule, etc.) or proportion of explained variance were common historically. However, it is generally agreed that PA is the best empirical method for component selection

What is the random seed that Prism asks for and is shown in the tabular results?

•Parallel analysis utilizes Monte Carlo simulations, and the random number generator needs a starting value - a seed. If you want to repeat an analysis exactly, you need to use the same seed each time. In case you want to do this, Prism will display the random seed used on the tabular results sheet if parallel analysis was selected. You may also enter a seed value on the parameters dialog. Note that random seeds are only relevant for parallel analysis, and none of the other methods used to select components utilize a random seed.

Understanding PCA Results

What relationships can't PCA see in the data?

•PCA reduces dimensionality of a dataset by creating linear combinations of the original variables. PCA can not identify nonlinear relationships between variables

What would happen if I take the PCs from PCA and use those as input to another PCA?

•By definition, each PC is orthogonal to every other PC, meaning the correlation between any two is exactly zero. In this situation, Prism does not create a table of PC scores or loadings because PCA is pointless

What does the correlation matrix of Principal Components look like?

•Each PC is orthogonal to every other PC, meaning the correlation between any two is exactly zero. The correlation matrix will show a value equal (or very close) to zero for all pairs of PCs (and 1.0 for the correlation of a PC with itself). You can test this yourself by performing the Correlation Matrix analysis on the PC Scores table in Prism. The values may not be exactly zero because of round-off issues in numerical calculations

When is the number of components different than the number of variables?

•The tabular results sheet from PCA lists both the number of components generated from a dataset as well as the number of original variables contained in that same dataset. These values are almost always the same. Note that the total number of components is generally larger than the number of selected components. The only situations in which the total number of components is less than the number of variables is: i) if two (or more) variables are identical to each other, or ii) if one variable is a linear combination of another variable. In both situations, the number of components will be less than the number of variables

Why are some rows skipped?

•PCA will only include rows that contain values for every variable (column) included in the analysis. Rows are skipped when the value for any variable on that row is blank (missing) or excluded. The tabular results of PCA displays how many rows are excluded

What should I do with my PCA results?

•If you ran Principal Component Regression (PCR) as part of PCA in Prism, the PCR results are what you'll want to look at. If you didn't run PCR, you may want to select and copy - or export - the PC scores table to analyze further. Many times, the goal of PCA is simply to look at some of the graphs Prism creates of the data projected onto the first few PCs. These visualization can often provide useful information on trends (groups, clusters, etc.) within the observations

Understanding PCR Results

Why do my PCR results have more coefficients than the number of principal components that were selected?

•Principal Component Regression (PCR) is the process of performing multiple linear regression using a specified outcome (dependent) variable, and the selected PCs from PCA as predictor variables. After performing the linear regression, the coefficients are converted to the scale of the original variables (using the linear combination of the original variables that defines each PC)

Why does the ANOVA table in the PCR results show so few degrees of freedom for the regression?

•The number of df for the regression equals the number of components selected by the PCA to be independent variables in the regression. So the number of coefficients almost always exceeds the number of regression degrees of freedom. That is, essentially, the whole point of PCR!

How should I interpret my PCR results?

•Principal Component Regression (PCR) is multiple regression that uses the principal components (PCs) created by PCA as independent variables along with another variable that you select (not part of the PCA) as the dependent variable. The structure of the results for PCR are identical to the results generated by multiple linear regression. Review the analysis checklist for multiple linear regression for more information.

Please enable JavaScript to view this site.