![]() |
|
The GraphPad Guide to |
|
Download the .pdf (printer-ready file) file now. Part 1. Introduction to nonlinear regression Nonlinear regression is a powerful tool for analyzing scientific data, especially in pharmacology and physiology. Because it is a topic ignored by most statistics books, we explain nonlinear regression in more detail than the other analyses performed by Prism. This chapter explains the principles of nonlinear regression, the next explains how easily you can perform nonlinear regression with Prism, and the following chapter helps you interpret the results. The goal of nonlinear regression is to fit a model to your data. The program finds the best-fit values of the variables in the model (perhaps rate constants, affinities, receptor number, etc.) which you can interpret scientifically. In most cases, the primary goal is to obtain those values and a secondary goal is to draw a graph of the fit curve. In some situations, you only goal is to draw a curve. You don't care about models or equations, and don't want to obtain best-fit values. You just want a smooth curve through your points either for artistic reasons or to use as a standard curve. You may still use nonlinear regression in these situations, or you may use these alternatives:
This chapter (and the next two) assumes that your goal is primarily to obtain the best-fit values of the variables -- to fit a model to your data. A note on terminology. A model is a formal presentation of a chemical or physiological idea. To be useful for nonlinear regression, the model must be expressed as an equation that defines Y, the outcome you measure, as a function of X and one or more variables that you want to fit. We use the term variable to refer to the terms in the equation you want to fit. In the context of nonlinear regression, the term variable does not refer to X and Y. Some programs and books use the word parameters rather than variables. Why you should use nonlinear regression Linear regression of transformed data is less accurate Before the age of microcomputers, nonlinear regression was not readily available to most scientists. Instead, scientists transformed their data to make a linear graph, and then analyzed the transformed data with linear regression. Examples include Lineweaver-Burke plots of enzyme kinetic data, Scatchard plots of binding data, and logarithmic plots of kinetic data. These methods are outdated, and should not be used to analyze data. The problem is that the linear transformation distorts the experimental error. Linear regression assumes that the scatter of points around the line follows a Gaussian distribution and that the standard deviation is the same at every value of X. These assumptions are usually not true with the transformed data. A second problem is that some transformations alter the relationship between X and Y. For example, in a Scatchard plot the value of X (bound) is used to calculate Y (bound/free), and this violates the assumptions of linear regression. Since the assumptions of linear regression are violated, the results of linear regression are incorrect. The values derived from the slope and intercept of the regression line are not the most accurate determinations of the variables in the model. Considering all the time and effort you put into collecting data, you want to use the best possible analysis technique. Nonlinear regression produces the most accurate results.
This figure shows the problem of transforming data. The left panel shows data that follows a rectangular hyperbola (binding isotherm). The right panel is a Scatchard plot of the same data. The solid curve on the left was determined by nonlinear regression. The solid line on the right shows how that same curve would look after a Scatchard transformation. The dotted line shows the linear regression fit of the transformed data. The transformation amplified and distorted the scatter, and thus the linear regression fit does not yield the most accurate values for Bmax and Kd. Transformations can be very useful when used appropriately. When analyzing data, follow these rules:
Although it is usually inappropriate to analyze transformed data, it is often helpful to display data after a linear transform. Many people find it easier to visually interpret transformed data. This makes sense because the human eye and brain evolved to detect edges (lines) - not to detect rectangular hyperbolas or exponential decay curves. Even if you analyze your data with nonlinear regression, it may make sense to display transformed data. Don't relegate scientific decisions to a computer program The goal of nonlinear regression is to fit a model to your data. The program finds the best-fit values of the variables in the model (perhaps rate constants, affinities, receptor number, etc.) which you can interpret scientifically. Choosing a model is a scientific decision. You should base your choice on your understanding of chemistry or physiology (or genetics, etc.). The choice should not be based solely on the shape of the graph. Some programs (not available from GraphPad) automatically fit data to hundreds or thousands of equations and then present you with the equation(s) that fit the data best. Using such a program is appealing because it frees you from the need to choose an equation. The problem is that the program has no understanding of the scientific context of you experiment. The equations that fit the data best are unlikely to correspond to scientifically meaningful models. You will not be able to interpret the best-fit values of the variables, and the results are unlikely to be useful for data analysis. This kind of approach is very useful in three situations: In all three situations, it doesn't matter whether the equation corresponds to a biological, chemical or physical model. What matters is that the equation accurately predict Y from X within the range of your data. This approach can be useful in some situations. Don't use it when the goal of curve fitting is to fit the data to a model based on chemical, physical, or biological principles. Don't use a computer program to avoid making a scientific decision. The results of polynomial regression are often impossible to interpret scientifically Beware of the term "curve fitting". The term is often used to refer not to nonlinear regression, but rather to polynomial regression. This method fits data to a polynomial equation: Y=A + BX + CX2 + DX3..... Programmers prefer polynomial regression, because it is so much easier to program. That's why it is built in to so many spreadsheet and graphics programs. But few biological or chemical models are described by polynomial equations, so polynomial regression is of limited usefulness to scientists. Cubic spline is not a data analysis method Cubic spline curves are smooth curves that go through every data point. In some cases, a cubic spline curve can look attractive on a graph and work well as a standard curve for interpolation. The curve does not correspond to any equation (or rather the equation differs for every pair of points) so cubic spline is not useful in data analysis. How nonlinear regression works Comparison of linear and nonlinear regression A line is described by a simple equation that calculates Y from X, slope and intercept. The purpose of linear regression is to find values for the slope and intercept that define the line that comes closest to the data. More precisely, it finds the line that minimizes the sum of the square of the vertical distances of the points from the line. The goal of minimizing the sum-of-squares in linear regression can be achieved quite simply. A bit of algebra (shown in many statistics books) derives equations that define the slope and intercept. Put the data in, and the answers come out. There is no chance for ambiguity. Nonlinear regression is more general. It can fit data to any equation that defines Y as a function of X and one or more variables. It finds the values of those variables that generate the curve that comes closest to the data. More precisely, the goal is to minimize the sum of the squares of the vertical distances of the points from the curve. Except for a few special cases, it is not possible to directly solve the equation to find the values of the variables that minimize the sum-of-squares. Instead nonlinear regression requires an iterative approach. Iterations in nonlinear regression Here are the steps that every nonlinear regression program follows:
Decisions you need to make when fitting curves with nonlinear regression When you use a program for nonlinear regression, you must make the following decisions. Choose a model To use nonlinear regression, you must first define a mathematical model based on theory. The first step is to choose a model. For example, many kinds of binding data are explained by the law of mass action. The next step is to express the model as an equation defines Y as a function of X and one or more variables. Some programs (not Prism) also let you define the model as a differential equation that defines dY/dX as a function of one or more variables. Choosing a model is a scientific decision, not a statistical one. The model needs to make sense in scientific terms. You may also fit two different models to your data, and then use statistical methods (F test) to compare them.(discussed in Part 2.). Prepare data for nonlinear regression When preparing data for nonlinear regression, keep these points in mind: Estimate initial values Nonlinear regression is an iterative procedure. The program must start with estimated values for each variable that are in the right "ball park" -- say within a factor of five of the actual value. It then adjusts these initial values to improve the fit. It then adjusts the values again and again until the improvement is tiny. If you have "clean" data that clearly define a curve, then it usually doesn't matter if the initial values are fairly far from the correct values. You'll get the same answer no matter what initial values you use, unless the initial values are very far from correct. Initial values matter more when your data have a lot of scatter, don't span a large enough range of X values to define a full curve, or don't really fit the model. In these cases, you may get different answers depending on which initial values you use. (False minima are discussed in Part 2.). You'll find it easy to estimate initial values if you have looked at a graph of the data, and understand the model and what all the variables mean. Remember, you just need an estimate. It doesn't have to be very accurate. If you are having problems estimating an initial value: Prism automatically provides initial values if you choose a built-in equation. If you use a user-defined equation, you can define rules for obtaining initial values from the range of the X and Y values. Once you define these rules, Prism will automatically determine the initial values in the future. Constants You don't have to fit every variable in the equation. In many situations it makes sense to fix some of the variables to constant values. For example, you might want to define the bottom plateau of a dose-response curve or an exponential decay curve to equal zero. Weighting In general, the goal of nonlinear regression is to find the values of the variables in the model that make the curve come as close as possible to the points. Usually this is done by minimizing the sum of the squares of the vertical distances of the data points from the curve. This is appropriate when you expect that the scatter of points around the curve is Gaussian and unrelated to the Y values of the points. (Note to those who have studied advanced statistics: If those assumptions are true, minimizing the sum-of-squares is equivalent to finding the maximum likelihood estimate of the variables). With many experimental protocols, you don't expect the experimental scatter to be the same, on average, for all points. Instead, you expect the experimental scatter to be a constant percentage of the Y value. If this is the case, points with high Y values will have more scatter than points with low Y values. When the program minimizes the sum of squares, points with high Y values will have a larger influence while points with smaller Y values will be relatively ignored. You can get around this problem by minimizing the sum of the square of the relative distances. This procedure is termed weighting the values by 1/Y2. Because it prevents large points from being over-weighted, the term unweighting seems more intuitive. It is also possible to weight the data in other ways. The goal, always, is to end up with a measure of goodness-of-fit that weights all the data points equally. If you collected replicate Y values at every value of X, there are two ways to analyze the data: Deciding which approach to use can be difficult. The advantage of the first approach is that you have more data points and thus more degrees of freedom. However, you should only use that approach when the experimental error of each replicate is no more closely related to the other replicates than to other data points. Here are two examples where you should analyze each replicate: You should not treat each replicate as a separate point when the experimental error of the replicates are related. You should average the replicates instead, and analyze the averages. Here are two examples where you should average the replicates: Go to Part 2: Interpreting nonlinear regression results |