Please enable JavaScript to view this site.

 Preparing data for analysis

The scale of data for PCA is extremely important. The first step in performing PCA is to ensure that the variables being analyzed are all on similar measurement scales. This is almost always accomplished by standardizing the data. The math of standardization is simple:

xstd=(xi-x̄)/sx

Where xstd is the standardized value, xi is the original value, x̄ is the variable’s mean, and sx is the variable’s standard deviation. In effect, this transforms the data in such a way that each variable has a mean of zero and a standard deviation of 1. Subsequently, each variable has a variance of 1 since variance is simply the square of the standard deviation:

varx=sx2

This step is actually very important to ensure that the results of PCA are interpreted correctly, as PCA is very sensitive to variances of the original variables. Specifically, the way PCA decides which variables are the “most important” when determining how to best reduce the dimension of the dataset is by determining which variables present the largest variance (more on this in the next section). If the original variables have variances that are quite different from each other, the analysis will end up favoring variables with larger variances and ignoring those that have smaller variances.

This may not seem like a bad thing at first, but often these differences in variances between variables isn’t due to the data itself, but the scale on which it was measured. As an example, consider some variables that could be involved while making beer. These may include the mass (in g) of grain used, the temperature (in °C) at which the beer is brewed, the volume (in L) of water used, or the length of time (in hours, days, or maybe weeks) that you allow for fermentation. Each of these variables are measured on very different scales, and the amount of variance that you would expect across multiple batches that you brew will probably be very different from variable to variable. For example, you may anticipate a difference of 2-5 °C or a few grams of grain from one run to another, but the volume may differ by as little as only 0.05 L. On the other extreme, the time that you take to ferment the beer may be just about 5 days (120 hours) up to as much as three or more weeks (500+ hours).

The variance of each of those variables will be quite different simply due to the fact that they are measured on different scales. The example of time is even more revealing since the variance will be greater if time is measured in hours (values between 120 and 500) compared to days (values between 5 and 21). Standardization solves this issue with the data by setting the variance of each variable to 1.

If the variables being used with PCA were all measured on the same scale and already have similar variances, then standardizing the data may not be necessary. Instead, the data are prepared by only subtracting the mean from each variable (the transformed variables will all have a mean of zero). This is called centering, and is less common, and only recommended if you are sure that the scales of measurement for the variables are comparable.