## Please enable JavaScript to view this site.

This guide is for an old version of Prism. Browse the latest version or update Prism
 Multicollinearity in logistic regression

Strongly correlated predictors, or more generally, linearly dependent predictors, cause estimation instability. What is meant by “linearly dependent predictors”? This simply means that one variable can be written as a linear function of the other. For example, variables X1 and X2 would be linearly dependent if X2 = 3*X1 + 6. This is a very simple example of linear dependence, but with this you can see that simply by knowing the value of X1, you automatically know the value of X2. Thus, as a predictor, X2 adds no new information to the model if X1 is already included.

With predictive modeling, this problem is called multicollinearity. In the extreme case, if two X columns in the model are exactly equal, the optimization algorithm can't determine the coefficient estimates for either column. This is because there are an infinite number of solutions. To see this more clearly, consider a simple case, where the estimated logistic regression model is logit(Y) = 1 + 2*X1. Now, say that we create X2, which is a duplicate of X1 and attempt to refit the model with both predictors. The predicted model could be represented in a number of equivalent ways, such as:

logit(Y) = 1 + X1 + X2

logit(Y) = 1 + 2*X1

logit(Y)  = 1 + 0.5 * X1 + 1.5 * X2

In fact, there are an infinite number of ways this equation could be re-written with different coefficients. In statistics, this model is said to be non-identifiable. In this extreme situation, standard errors, confidence intervals and P values can't be calculated.

More common in practice is to have predictor columns that are strongly, but not perfectly, correlated. Although Prism will produce estimates in this case, a similar problem occurs. The multicollinearity increases uncertainty in the parameter estimates, and thus increases confidence intervals and P values.

If your only concern is prediction, then the standard errors being large are not actually a problem. However, if you are interested in interpreting the magnitude of the coefficient estimates (e.g. the larger X1 is, the higher the probability of a success), then multicollinearity is a problem.

In Prism, you can evaluate multicollinearity using variance inflation factors (VIFs). The general rule of thumb is that VIFs greater than 10 indicate strong multicollinearity. In that event, you probably want to remove one of the columns with a high VIF, refit the model and repeat as necessary. VIFs are described in more detail here.

You can also choose to have Prism output a correlation matrix. This presents the pairwise correlation between the predictors in matrix form. Variables that are highly correlated with other variables in the model will cause problems with estimating standard errors, confidence intervals and P values.