Please enable JavaScript to view this site.

Principal component regression (PCR) is a combination of multiple linear regression and principal component analysis.

One of the primary goals of Principal Component Analysis is to reduce the number of predictors for a future analysis, which also reduces the number of degrees of freedom in the model. For example, you might start with 50 columns of X variables, and with PCA, perhaps it’s possible to explain a large (70-95%) amount of the variation within the X variables by using only a few (3, 5, 10, ...) principal components. Then, to model your outcome variable, Y, you could use only the selected principal components as predictors instead of the original 50 columns of data.

When you opt to run PCR from the PCA dialog, you must choose which variable is the outcome (dependent) variable. It cannot be one of the input variables.  Prism then performs the following steps:

1.Perform PCA on all of the selected variables

2.Extract the appropriate number of principal components using the method selected

3.Perform multiple linear regression and the selected principal component scores as predictors along with an intercept

4.Convert the parameter coefficient estimates, which were calculated using the PC scores, back to the scale of original variables (using the linear combinations of variables defined for each PC)

If you want more flexibility with your regression model (such as fitting a logistic or Poisson model), you can run multiple linear regression by copying/pasting the results of the PC scores table to a new multiple variables data table along with the dependent (outcome) variable of interest. Note that in this case, any resulting slope coefficients would be in terms of the principal components as opposed to the original variables.

Is PCR the same as variable selection?

No, this is a different concept than variable selection. Variable selection is the process of determining which variables to include in your model. With variable selection, you choose which variables to include and exclude from your regression model. If you want to use the model to predict future values of the outcome variable, you only need to measure the variables that were retained in your model.

In contrast, with PCR, all of the original variables are used to calculate each principal component (with variable weighting). This means that in order to use the model generated by PCR to predict future values of the outcome variable, you would still need to obtain measurements for all of the original variables.

So you might only have one predictor (one PC) in your model, but that predictor is calculated using all of your original variables. This is why we say that PCA reduces the dimension of your data. If you really want to understand the nuts and bolts of this, you’ll need to familiarize yourself with linear algebra and the singular value decomposition.

Is PCR "cheating" like variable selection and p-hacking?

No.

© 1995-2019 GraphPad Software, LLC. All rights reserved.