Contents

Statistical principles

Analyzing one group

Analyzing two groups

Analysis of variance (ANOVA):

Choosing an analyses

One-way ANOVA

Repeated measures one-way ANOVA

Kruskal-Wallis test

Friedman's test

Two-way ANOVA

Analyzing survival data

Categorical data
(contingency tables)

Correlation & linear regression

Our Products...
Prism
InStat
StatMate
Intuitive Biostatistics


© 1999 GraphPad Software Inc.

The Prism Guide to Interpreting Statistical Results
This guide is excerpted from Analyzing Data with GraphPad Prism, a book that accompanies the program GraphPad Prism. Browse this guide using the Contents navigation on the left. You may also download the entire book.

Interpreting two-way ANOVA

Introduction to two-way ANOVA

Two-way ANOVA, also called two factor ANOVA, determines how a response is affected by two factors. For example, you might measure a response to three different drugs in both men and women.

Two-way ANOVA simultaneously asks three questions:

  • Does the first factor systematically affect the results? In our example: Are the mean responses the same for all three drugs?
  • Does the second factor systematically affect the results? In our example: Are the mean responses the same for men and women?
  • Do the two factors interact? In our example: Are the difference between drugs the same for men and women? Or equivalently, is the difference between men and women the same for all drugs?

Although the outcome measure (dependent variable) is a continuous variable, each factor must be categorical, for example: male or female; low, medium or high dose; wild type or mutant. ANOVA is not an appropriate test for assessing the effects of a continuous variable, such as blood pressure or hormone level (use a regression technique instead).

Prism can perform ordinary two-way ANOVA accounting for repeated measures when there is matching on one of the factors (but not both). Prism cannot perform any kind of nonparametric two-way ANOVA.

How two-way ANOVA works

Two-way ANOVA, determines how a response is affected by two factors. For example, you might measure a response to three different drugs in both men and women.

The ANOVA table breaks down the overall variability between measurements (expressed as the sum of squares) into four components:

  • Interactions between row and column. This is differences between rows that is not the same at each column, equivalent to variation between columns that is not the same at each row.
  • Variability among columns.
  • Variability among rows.
  • Residual or error. Variation among replicates not related to systematic differences between rows and columns.

With repeated measures ANOVA there is a fifth component: variation between subjects.

The ANOVA table shows how the sum of squares is partitioned into the four (or five) components. For each component, the table shows sum-of-squares, degrees of freedom, mean square, and the F ratio. Each F ratio is the ratio of the mean-square value for that source of variation to the residual mean-square. (with repeated measures ANOVA, the denominator of one F ratio is the mean square for matching rather than residual). If the null hypothesis is true, the F ratio is likely to be close to 1.0. If the null hypothesis is not true, the F ratio is likely to be greater than 1.0. The F ratios are not very informative by themselves, but are used to determine P values.

How Prism computes two-way ANOVA

 Model I (fixed effects) vs. Model II (random effects) ANOVA

To understand the difference between fixed and random factors, consider an example of comparing responses in three species at three times. If you were interested in those three particular species, then species is considered to be a fixed factor. It would be a random factor if you were interested in differences between species in general, and randomly selected those three species. Time is considered to be a fixed factor if you chose time points to span the interval you are interested in. Time would be a random factor if you picked those three time points at random. Since this is not likely, time is almost always considered to be a fixed factor.

When both row and column variables are fixed factors, the analysis is called Model I ANOVA. When both row and column variables are random factors, the analysis is called Model II ANOVA. When one is random and one is fixed, it is termed mixed effects (Model III) ANOVA. Prism calculates only Model I two-way ANOVA. Since most experiments deal with fixed-factor variables, this is rarely a limitation.

ANOVA from data entered as mean, SD (or SEM) and N

If your data are balanced (same sample size for each condition), you'll get the same results if you enter raw data, or mean, SD (or SEM) and N. If your data are unbalanced, it is impossible to calculate precise results from data entered as mean, SD (or SEM) and N. Instead, Prism uses a simpler method called analysis of "unweighted means". This method is detailed in LD Fisher and G vanBelle, Biostatistics, John Wiley, 1993.  If sample size is the same in all groups, and in some other special cases, this simpler method gives exactly the same results as obtained by analysis of the raw data. In other cases, however, the results will only be approximately correct. If your data are almost balanced (just one or a few missing values), the approximation is a good one. When data are unbalanced, you should enter individual replicates whenever possible

Two-way ANOVA calculations with missing values

If some values are missing, two-way ANOVA calculations are challenging. Prism uses the method detailed in SA Glantz and BK Slinker, Primer of Applied Regression and Analysis of Variance, McGraw-Hill, 1990 . This method converts the ANOVA problem to a multiple regression problem, and then displays the results as ANOVA. Prism performs multiple regression three times - each time presenting columns, rows and interaction to the multiple regression procedure in a different order. Although it calculates each sum-of-squares three times, Prism only displays the sum-of-squares for the factor entered last into the multiple regression equation. These are called Type III sum-of-squares.

Prism cannot perform repeated measures two-way ANOVA with missing values.

Two-way ANOVA from unreplicated data

Prism can perform two-way ANOVA even if you have entered only a single replicate for each column/row pair. This kind of data does not let you test for interaction between rows and columns (random variability and interaction can't be distinguished unless you measure replicates). Instead, Prism assumes that there is no interaction, and only tests for row and column effects. If this assumption is not valid, then the P values for row and column effects won't be meaningful.

The concept of repeated measures doesn't apply when your data are unreplicated.

Repeated measures two-way ANOVA

Prism computes repeated measures two-way ANOVA calculations using the standard method explained especially well in SA Glantz and BK Slinker, Primer of Applied Regression and Analysis of Variance, McGraw-Hill, 1990.

Post tests following two-way ANOVA

Prism performs post tests following two-way ANOVA using the Bonferroni method as detailed in pages 741-744 and 771 in J Neter, W Wasserman, and MH Kutner, Applied Linear Statistical Models, 3rd edition, Irwin, 1990.

For each row, Prism calculates

The numerator is the difference between the mean response in the two data sets (usually control and treated) at a particular row (usually dose or time point). The denominator combines the number of replicates in the two groups at that dose with the mean square of the residuals (sometimes called the mean square of the error), which is a pooled measure of variability at all doses.

Statistical significance is determined by comparing the t ratio with the t distribution for the number of df shown in the ANOVA table for MSresidual, applying the Bonferroni correction for multiple comparisons. The Bonferroni correction lowers the P value that you consider to be significant to 0.5 divided by the number of comparisons. This means that if you have five rows of data, the P value has to be less than 0.01 (0.5/5) for any particular row in order to be considered significant with P<0.05. This correction ensures that the 5% probability applies to the entire family of comparisons, and not separately to each individual comparison.

Confidence intervals at each row are computed using this equation:

The critical value of t is abbreviated t* in that equation (not a standard abbreviation). Its value does not depend on your data, only on your experimental design. It depends on the number of degrees of freedom and the number of rows (number of comparisons).

Post tests following repeated measures two-way ANOVA use exactly the same equation if the repeated measures are by row. If the repeated measures are by column, use MSsubject rather than MSresidual in both equations above, and use the degrees of freedom for subjects (matching) rather than the residual degrees of freedom.

How to think about results from two-way ANOVA

Two-way ANOVA partitions the overall variance of the outcome variable into three components plus a residual (or error) term.

Interaction

The null hypothesis is that there is no interaction between columns (data sets) and rows. More precisely, the null hypothesis states that any systematic differences between columns are the same for each row and that any systematic differences between rows are the same for each column. If columns represent drugs and rows represent gender, then the null hypothesis is that the differences between the drugs are consistent for men and women.

The P value answers this question: If the null hypothesis is true, what is the chance of randomly sampling subjects and ending up with as much (or more) interaction than you have observed. Often the test of interaction is the most important of the three tests.

If each row represents a time or concentration, there is no interaction if the vertical difference between the curves is the same for all values of X. Some statistics books say that there is no interaction when the curves are "parallel". But that term can be ambiguous. Pharmacologists consider two dose-response curves "parallel" when two drugs have similar effects at very low and very high concentrations, but different (and horizontally parallel) effects at moderate concentrations. Two-way ANOVA of such data would reject the null hypothesis of no interaction, because the difference between Y values in the middle of the curves is very different than the difference at the ends.

If you entered only a single value for each row/column pair, it is impossible to test for interaction between rows and columns. Instead, Prism assumes that there is no interaction, and continues with the other calculations. Depending on your experimental design, this assumption may or may not make sense. The assumption cannot be tested without replicate values.

Note: If the interaction is statistically significant, it is difficult to interpret the row and column effects. Statisticians often recommend ignoring the tests of row and column effects when there is a significant interaction.

Column factor

The null hypothesis is that the mean of each column (totally ignoring the rows) is the same in the overall population, and that all differences we see between column means are due to chance. If columns represent different drugs, the null hypothesis is that all the drugs produced the same effect. The P value answers this question: If the null hypothesis is true, what is the chance of randomly obtaining column means as different (or more so) than you have observed.

Row factor

The null hypothesis is that the mean of each row (totally ignoring the columns) is the same in the overall population, and that all differences we see between row means are due to chance. If the rows represent gender, the null hypothesis is that the mean response is the same for men and women. The P value answers this question: If the null hypothesis is true, what is the chance of randomly obtaining row means as different (or more so) than you have observed.

Subject (matching)

For repeated measures ANOVA, Prism tests the null hypothesis that the matching was not effective. You expect a low P value if the repeated measures design was effective in controlling for variability between subjects. If the P value was high, reconsider your decision to use repeated measures ANOVA.

How to think about post tests following two-way ANOVA

If you have two data sets (columns), Prism can perform post tests to compare the two means from each row.

For each row, Prism reports the 95% confidence interval for the difference between the two means. These confidence intervals adjust for multiple comparisons, so you can be 95% certain that all the intervals contain the true difference between means.

For each row, Prism also reports the P value testing the null hypothesis that the two means are really identical. Again, the P value computations take into account multiple comparisons. If there really are no differences, there is a 5% chance that any one (or more) of the P values will be less than 0.05. The 5% probability applies to the entire family of comparisons, not to each individual P value.

If the difference is statistically significant

If the P value for a post test is small, then it is unlikely that the difference you observed is due to a coincidence of random sampling. You can reject the idea that those two populations have identical means.

Because of random variation, the difference between the group means in this experiment is unlikely to equal the true difference between population means. There is no way to know what that true difference is. With most post tests (but not the Newman-Keuls test), Prism presents the uncertainty as a 95% confidence interval for the difference between all (or selected) pairs of means. You can be 95% sure that this interval contains the true difference between the two means.

To interpret the results in a scientific context, look at both ends of the confidence interval and ask whether they represent a difference between means that would be scientifically important or scientifically trivial.

Lower confidence limit

Upper confidence limit

Conclusion

Trivial difference Trivial difference Although the true difference is not zero (since the P value is low) the true difference between means is tiny and uninteresting. The treatment had an effect, but a small one.
Trivial difference Important difference Since the confidence interval ranges from a difference that you think are biologically trivial to one you think would be important, you can't reach a strong conclusion from your data. You can conclude that the means are different, but you don't know whether the size of that difference is scientifically trivial or important. You'll need more data to draw a clear conclusion.
Important difference Important difference Since even the low end of the confidence interval represents a difference large enough to be considered biologically important, you can conclude that there is a difference between treatment means and that the difference is large enough to be scientifically relevant.

If the difference is not statistically significant

If the P value from a post test is large, the data do not give you any reason to conclude that the means of these two groups differ. Even if the true means were equal, you would not be surprised to find means this far apart just by coincidence. This is not the same as saying that the true means are the same. You just don't have evidence that they differ.

How large could the true difference really be?  Because of random variation, the difference between the group means in this experiment is unlikely to equal the true difference between population means. There is no way to know what that true difference is. Prism presents the uncertainty as a 95% confidence interval (except with the Newman-Keuls test). You can be 95% sure that this interval contains the true difference between the two means. When the P value is larger than 0.05, the 95% confidence interval will start with a negative number (representing a decrease) and go up to a positive number (representing an increase).

To interpret the results in a scientific context, look at both ends of the confidence interval for each pair of means, and ask whether those differences would be scientifically important or scientifically trivial.

Lower confidence limit

Upper confidence limit

Conclusion

Trivial decrease Trivial increase You can reach a crisp conclusion. Either the means really are the same or they are different by a trivial amount. At most, the true difference between means is tiny and uninteresting.
Trivial decrease Large increase You can't reach a strong conclusion. The data are consistent with the treatment causing a trivial decrease, no change, or a large increase. To reach a clear conclusion, you need to repeat the experiment with more subjects.
Large decrease Trivial increase You can't reach a strong conclusion. The data are consistent with a trivial increase, no change, or a decrease that may be large enough to be important. You can't make a clear conclusion without repeating the experiment with more subjects.
Large decrease Large increase You can't reach any conclusion. Repeat the experiment with a much larger sample size.

Problems with post tests following two-way ANOVA

Post test are often used to compare dose-response curves or time course curves. Using two-way ANOVA in this way presents two problems. One problem is that ANOVA treats different doses (or time points) exactly as it deals with different species or different drugs. ANOVA ignores the fact that doses or time points come in order. You could jumble the doses in any order, and get exactly the same ANOVA results. However, you did the experiment to observe a trend, so you should be cautious about interpreting results from an analysis method that doesn't recognize trends.

Another problem with the ANOVA approach is that it is hard to interpret the results. Knowing at which doses or time points the treatment had a statistically significant effect doesn't always help you understand the biology of the system, and rarely helps you design new experiments. Some scientists like to ask which is the lowest dose (or time) at which the effect of the treatment is statistically significant. The post tests give you the answer, but the answer depends on sample size. Run more subjects, or more doses or time points for each curve, and the answer will change. Rather than two-way ANOVA, consider using linear or nonlinear regression to fit the curve to a model and then compare the fits.

Checklist. Is two-way ANOVA the right test for these data?

Before accepting the results of any statistical test, first think carefully about whether you chose an appropriate test. Before accepting results from a one-way ANOVA, ask yourself these questions:

Question

Discussion

Are the populations distributed according to a Gaussian distribution?

Two-way ANOVA assumes that your replicates are sampled from Gaussian distributions. While this assumption is not too important with large samples, it is important with small sample sizes, especially with unequal sample sizes. Prism does not test for violations of this assumption. If you really don't think your data are sampled from a Gaussian distribution (and no transform will make the distribution Gaussian), you should consider performing nonparametric two-way ANOVA. Prism does not offer this test.

ANOVA also assumes that all sets of replicates have the same SD overall, and that any differences between SDs are due to random sampling.

Are the data matched?

Standard two-way ANOVA works by comparing the differences among group means with the pooled standard deviations of the groups. If the data are matched, then you should choose repeated measures ANOVA instead. If the matching is effective in controlling for experimental variability, repeated measures ANOVA will be more powerful than regular ANOVA.

Are the "errors" independent?

The term "error" refers to the difference between each value and the mean of all the replicates. The results of two-way ANOVA only make sense when the scatter is random - that whatever factor caused a value to be too high or too low affects only that one value. Prism cannot test this assumption. You must think about the experimental design. For example, the errors are not independent if you have six replicates, but these were obtained from two animals in triplicate. In this case, some factor may cause all values from one animal to be high or low.  See The need for independent samples.

Do you really want to compare means?

Two-way ANOVA compares the means. It is possible to have a tiny P value - clear evidence that the population means are different - even if the distributions overlap considerably. In some situations - for example, assessing the usefulness of a diagnostic test - you may be more interested in the overlap of the distributions than in differences between means.

Are there two factors?

One-way ANOVA compares three or more groups defined by one factor. For example, you might compare a control group, with a drug treatment group and a group treated with drug plus antagonist. Or you might compare a control group with five different drug treatments. Prism has a separate analysis for one-way ANOVA.

Some experiments involve more than two factors. For example, you might compare three different drugs in men and women at four time points. There are three factors in that experiment: drug treatment, gender and time. These data need to be analyzed by three-way ANOVA, also called three factor ANOVA. Prism does not perform three-way ANOVA.

Are both factors "fixed" rather than "random"?

Prism performs Type I ANOVA, also known as fixed-effect ANOVA. This tests for differences among the means of the particular groups you have collected data from. Different calculations are needed if you randomly selected groups from an infinite (or at least large) number of possible groups, and want to reach conclusions about differences among ALL the groups, even the ones you didn't include in this experiment. See Model I (fixed effects) vs. Model II (random effects) ANOVA.

The circularity assumption in two-way repeated measures ANOVA

Repeated measures ANOVA assumes that the random error truly is random. A random factor that causes a measurement in one subject to be a bit high (or low) should have no affect on the next measurement in the same subject. This assumption is called circularity or sphericity. It is closely related to another term you may encounter, compound symmetry.

Repeated measures ANOVA is quite sensitive to violations of the assumption of circularity. If the assumption is violated, the P value will be too low. You'll violate this assumption when the repeated measurements are made too close together so that random factors that cause a particular value to be high (or low) don't wash away or dissipate before the next measurement.  To avoid violating the assumption, wait long enough between treatments so the subject is essentially the same as before the treatment. Also randomize the order of treatments, when possible.

You only have to worry about the assumption of circularity when your experiment truly is a repeated measures experiment, with measurements from a single subject. You don't have to worry about circularity with randomized block experiments, where you used a matched set of subjects (or a matched set of experiments).