Which model is 'best'? At first, the answer seems simple. The goal of nonlinear regression is to minimize the sum-of-squares, so it seems as though the model with the smaller sum-of-squares is best. If the two alternative models both have the same numbers of parameters, then indeed that is the best approach.

But that approach is too simple when the models have different numbers of parameters, which is usually the case. A model with more parameters can have more inflection points, so of course comes closer to the points. It can bend and twist more to get near the data points. Accordingly, a two-phase model almost always fits better than a one-phase model, and a three-phase model fits even better. So any method to compare a simple model with a more complicated model has to balance the decrease in sum-of-squares with the increase in the number of parameters.

Prism offers three approaches to comparing models with different numbers of parameters. These are not the only methods that have been developed to solve this problem, but are the most commonly used methods.

The Extra sum-of-squares F test is based on traditional statistical hypothesis testing. It is used only for least-squares regression (not Poisson regression).

The null hypothesis is that the simpler model (the one with fewer parameters) is correct. The improvement of the more complicated model is quantified as the difference in sum-of-squares. You expect some improvement just by chance, and the amount you expect by chance is determined by the number of data points and the number of parameters in each model. The F test compares the difference in sum-of-squares with the difference you would expect by chance. The result is expressed as the F ratio, from which a P value is calculated.

The P value answers this question:

If the simpler model was correct, in what fraction of experiments (the size of yours) will the difference in sum-of-squares be as large as you observed, or even larger?

If the P value is small, conclude that the simple model (the null hypothesis) is wrong, and accept the more complicated model. Usually the threshold P value is set at its traditional value of 0.05. If the P value is less than 0.05, then you reject the simpler (null) model and conclude that the more complicated model fits significantly better.

When you choose Poisson nonlinear regression, Prism does not offer the F test (above) but instead the likelihood ratio test.

The likelihood ratio answers this question: How much more likely would the data have been observed one model were true compared to if the other model were true. A P value is computed from the likelihood ratio and the the difference in df between the two models.

The P value answers this question:

If the simpler model was correct, in what fraction of experiments (the size of yours) will the likelihood ratio be as large as you observed, or even larger?

If the P value is small, conclude that the simple model (the null hypothesis) is wrong, and accept the more complicated model. Usually the threshold P value is set at its traditional value of 0.05. If the P value is less than 0.05, then you reject the simpler (null) model and conclude that the more complicated model fits significantly better.

The extra sum-of-squares F test is equivalent to the likelihood ratio test when you choose least-squares regression.

This alternative approach is based on information theory, and does not use the traditional “hypothesis testing” statistical paradigm. Therefore it does not generate a P value, does not reach conclusions about “statistical significance”, and does not “reject” any model.

The method determines how well the data supports each model, taking into account both the goodness-of-fit (sum-of-squares) and the number of parameters in the model. The results are expressed as the probability that each model is correct, with the probabilities summing to 100%. If one model is much more likely to be correct than the other (say, 1% vs. 99%), you will want to choose it. If the difference in likelihood is not very big (say, 40% vs. 60%), you will know that either model might be correct, so will want to collect more data. How the calculations work.

In most cases, the models you want to compare will be 'nested'. This means that one model is a simpler case of the other. For example, a one-phase exponential model is a simpler case of a two-phase exponential model. A three parameter dose-response curve with a standard Hill slope of 1.0 is a special case of a four parameter dose-response curve that finds the best-fit value of the Hill slope as well.

If the two models are nested, you may use either the F test (or likelihood ratio, if Poisson regression) or the AIC approach. The choice is usually a matter of personal preference and tradition. Basic scientists in pharmacology and physiology tend to use the F test. Scientists in fields like ecology and population biology tend to use AIC approach.

If the models are not nested, then the F test and the likelihood ratio test are not valid, so you should choose the information theory approach. Note that Prism does not test whether the models are nested.

The Compare tab of Prism lets you ask "Do the best-fit values of selected unshared parameters differ between data sets?" or "Does one curve adequately fit all data sets?". Applying the F test or Akaike's method to answering these questions is straightforward. Prism compares the sum-of-squares of two fits.

•In one fit, the model is separately fit to each data set, and the goodness-of-fit is quantified with a sum-of-squares. The sum of these sum-of-square values quantifies the goodness of fit of the family of curves fit to all the data sets.

•The other fit is a global fit to all the data sets at once, sharing specified parameters. If you ask Prism whether one curve adequately fits all data sets, then it shares all the parameters.

These two fits are nested (the second is a simpler case of the first, with fewer parameters to fit) so the sums-of-squares (actually the sum of sum of squares for the first fits) can be compared using either the F test or Akaike's method.

R2 is a measure of how well a model fits your data, so it would seem to make sense to choose among competing models by picking the one that has the lowest R2 or adjusted R2. In fact, it works really poorly (1). Don't use this method!

1. Spiess, A.-N. & Neumeyer, N. An evaluation of R2 as an inadequate measure for nonlinear models in pharmacological and biochemical research: a Monte Carlo approach. BMC Pharmacol 10, 6–6 (2010).

Interpreting comparison of models