Please enable JavaScript to view this site.

 Analysis checklist: Multiple logistic regression

To check that multiple logistic regression is an appropriate analysis for these data, ask yourself these questions.

Is the outcome (Y) variable binary (dichotomous)?

The independent (Y) variable may only take on two values and in Prism, these must be coded as 0 and 1.

Are the rows of Y independent observations?

One of the fundamental assumptions of logistic regression is that each row of data is a unique, independent observation. An example of independent observations is a study on 100 randomly selected people where a 1 indicates a positive outcome and a 0 a negative outcome, and each person is recorded on a single row. If each person was measured more than once (say at various time points in the study), then the observations are not independent and logistic regression isn't appropriate. If the study was of 50 married couples, it is not fair to treat the data as 100 independent observations.

Does the model fit and predict the data well?

All models are wrong, but some are useful…

Prism offers a variety of metrics to evaluate how well the specified model fits to the entered data. However, you should keep in mind that fitting models to data and interpretation of model fits is – to some extent – subjective. Some possibilities to consider when evaluating a given model include:

Does the logistic model classify data well? In other words, given an appropriate cutoff value (such as 0.5), does the model correctly predict the observed 0s and 1s? You can evaluate this in Prism using the Predicted vs. Observed graph, a classification table, Tjur's R-squared, an ROC plot, and the row classification table.

Does the logistic model outperform an intercept-only model? You can test this with the likelihood ratio hypothesis test. You may also want to run the Hosmer-Lemeshow test.

Are the X variables linearly dependent?

If the X variables have high multicollinearity, the estimated P values and standard errors will be meaningless. Read more about multicollinearity for more information.

Do you have sufficient data to trust your results?

As with all stats modeling, the more data the better. At the bottom of the Tabular results sheet of the analysis results, Prism will report how many observations were in the model (Rows analyzed), how many parameters were included in the model, and the ratio of these two values (#observations/#parameters). One rule of thumb is to have at least ten rows of 0s and ten rows of 1s for each independent (X) variable.

Are you overfitting?

Do changes in your variables actually contribute to changes in the log(odds) of success? If not, do you still want to include those covariates in the model? Sometimes it's important to keep variables in the model for their explanation or because of how an experiment was designed based on your knowledge of the experiment and the science involved. However, if a variable isn't necessary, perhaps it's enough to say that it doesn't help predict the outcome in the presence of the other X predictors and remove it. But removing variables is controversial, so don't do so without a lot of thought.

Are you underfitting?

If your prediction performance isn't as good as desired, then perhaps you are missing some key variables that you either didn't measure or didn’t include in the model. If the key variable is one that you didn’t measure, there isn’t much you can do but go back out and collect more data. However, if you left some variables out of the model, you might want to examine how those affect model performance when they’re included. You can also explore fitting models with interactions and transformations on the X variables.