GraphPad Curve Fitting Guide

Multicollinearity

Multicollinearity

Previous topic Next topic No expanding text in this topic  

Multicollinearity

Previous topic Next topic JavaScript is required for expanding text JavaScript is required for the print function Mail us feedback on this topic!  

What is multicollinearity?

It is important to understand the concept of multicollinearity, as it can interfere with proper interpretation of multiple regression results.

To understand multicollinearity, first consider an absurd example. Imagine that you are running multiple regression to predict blood pressure from age and weight. Now imagine that you've entered weight-in-pounds and weight-in-kilograms as two separate X variables. The two X variables measure exactly the same thing - the only difference is that the two variables have different units. The P value for the overall fit is likely to be low, telling you that blood pressure is linearly related to age and weight. Then you'd look at the individual P values. The P value for weight-in-pounds would be very high - after including the other variables in the equation, this one adds no new information. Since the equation has already taken into account the effect of weight-in-kilograms on blood pressure, adding the variable weight-in-pounds to the equation adds nothing. But the P value for weight-in-kilograms would also be high for the same reason. After you include weight-in-pounds to the model, the goodness-of-fit is not improved by including the variable weight-in-kilograms. When you see these results, you might mistakenly conclude that weight does not influence blood pressure at all since both weight variables have very high P values. The problem is that the P values only assess the incremental effect of each variable. In this example, neither variable has any incremental effect on the model. The two variables are collinear.

That example is a bit absurd, since the two variables are identical except for units. The blood pressure example -- model blood pressure as a function of age, weight and gender - is more typical. It is hard to separate the effects of age and weight, if the older subjects tend to weigh more than the younger subjects. It is hard to separate the effects of weight and gender if the men weigh more than the women. Since the X variables are intertwined, multicollinearity will make it difficult to interpret the multiple regression results.

Quantifying multicollinearity

Multicollinearity is an intrinsic problem of multiple regression, and it can frustrate your ability to make sense of the data. All Prism can do is warn you about the problem. It does this by asking how well each independent (X) variable can be predicted from the other X variables (ignoring the Y variable), expressing the results in two ways:

R2 with other X variables.  The fraction of the variance in one X variable that can be predicted from the other X variables. The Y variable is not part of the calculation.

Variance Inflation Factor (VIF).  If the X variables contain no redundant information, you expect VIF to equal one. If the X variables are collinear (contain redundant information), then VIF will be greater than one. VIF is related to R2 by this equation: VIF=1/(1-R2).

Some programs also compute Tolerance, but Prism does not. You can easily calculate it yourself for each variable as 1.0 - R2 .

When multicollinearity is high

If R2 and VIF are high for some X variables, then multicollinearity is a problem in your data. How high is high? Any threshold is arbitrary, but here is one rule of thumb. If any of the R2 values are greater than 0.75 (so VIF is greater than 4.0), suspect that multicollinearity might be a problem. If any of the R2 values are greater than 0.90 (so VIF is greater than 10) then conclude that multicollinearity is a serious problem.

If multicollinearity is a big problem, the results of multiple regression are unlikely to be helpful. Possible methods to resolve the issue:

1.Make sure you didn't include redundant information. Say your study includes both men and women so you have one independent variable "Woman" that is 1 for females and 0 for males, and another variable "Man" that is 0 for females and 1 for males. You've introduced collinearity because the two variables encode the same information. Only one variable is needed.

2.Combine variables. An example of correlated variables would be including both weight and height in a model, as people who are taller also tend to weigh more. One way around this issue would be to compute the body mass index (BMI) from height and weight, and only include that single variable in the mode, rather than including both height and weight.

3.In some cases, removing one or more variables from the model will reduce multicollinearity to an acceptable level.

4.In other cases, you may be able to reduce multicollinearity by collecting data over a wider range of experimental conditions.

This is a difficult problem, and you may need to seek statistical guidance elsewhere.

Notes

Don't confuse these individual R2 values for each X variable with the overall R2. The individual R2 values quantify how well each X variable can be predicted from the other X variables. The overall R2 quantifies goodness-of-fit of the entire multiple regression model. Generally you want the overall R2 value to be high (good fit) while all the individual R2 values to be low (little multicollinearity).

Multicollinearity increases the width of the confidence interval (which is proportional to the square root of variance) by a factor equal to the square root of VIF. If a variable has a VIF of 9, the confidence interval of that coefficient is three times wider than it would be were it not for multicollinearity.  

When you only have two independent variables, the problem is called collineariaty. With three or more, the term multicollinearity is used.