Q&A: Multiple comparisons tests

Print this Topic

If the overall ANOVA finds a significant difference among groups, am I certain to find a significant post test?

If one-way ANOVA reports a P value of <0.05, you reject the null hypothesis that all the data come from populations with the same mean. In this case, it seems to make sense that at least one of the post tests will find a significant difference between pairs of means. But this is not necessarily true.

It is possible that the overall mean of group A and group B combined differs significantly from the combined mean of groups C, D and E. Perhaps the mean of group A differs from the mean of groups B through E. Scheffe's post test detects differences like these (but this test is not offered by Prism). If the overall ANOVA P value is less than 0.05, then Scheffe's test will definitely find a significant difference somewhere (if you look at the right comparison, also called contrast). The post tests offered by Prism only compare group means, and it is quite possible for the overall ANOVA to reject the null hypothesis that all group means are the same yet for the post test to find no significant difference among group means.

If the overall ANOVA finds no significant difference among groups, are the post test results valid?

You may find it surprising, but all the post tests offered by Prism are valid even if the overall ANOVA did not find a significant difference among means. It is certainly possible that the post tests of Bonferroni, Tukey, Dunnett, or Newman-Keuls can find significant differences even when the overall ANOVA showed no significant differences among groups. These post tests are more focussed, so have power to find differences between groups even when the overall ANOVA is not significant.

"An unfortunate common practice is to pursue multiple comparisons only when the null hypothesis of homogeneity is rejected." (Hsu, page 177)

There are two exceptions, but these are for tests that Prism does not offer. Scheffe's test is intertwined with the overall F test. If the overall ANOVA has a P value greater than 0.05, then no post test using Scheffe's method will find a significant difference. Another exception is Fisher's Least Significant Difference (LSD) test (which Prism does not offer). In its original form (called the restricted Fisher's LSD test) the post tests are performed only if the overall ANOVA finds a statistically significant difference among groups. But this LSD test is outmoded, and no longer recommended.

Are the results of the overall ANOVA useful at all? Or should I only look at post tests?

ANOVA tests the overall null hypothesis that all the data come from groups that have identical means. If that is your experimental question -- does the data provide convincing evidence that the means are not all identical -- then ANOVA is exactly what you want. More often, your experimental questions are more focussed and answered by multiple comparison tests (post tests). In these cases, you can safely ignore the overall ANOVA results and jump right to the post test results.

Note that the multiple comparison calculations all use the mean-square result from the ANOVA table. So even if you don't care about the value of F or the P value, the post tests still require that the ANOVA table be computed.

The q or t ratio

Each Bonferroni comparison is reported with its t ratio. Each comparison with the Tukey, Dunnett, or Newman-Keuls post test is reported with a q ratio. We include it so people can check our results against text books or other programs. The value of q won't help you interpret the results.

For a historical reason (but no logical reason), the q ratio reported by the Tukey (and Newman-Keuls) test and the one reported by Dunnett's test differ by a factor of the square root of 2, so cannot be directly compared.

Significance

“Statistically significant” is not the same as “scientifically important”. Before interpreting the P value or confidence interval, you should think about the size of the difference you are looking for. How large a difference would you consider to be scientifically important? How small a difference would you consider to be scientifically trivial? Use scientific judgment and common sense to answer these questions. Statistical calculations cannot help, as the answers depend on the context of the experiment.

Compared to comparing two groups with a t test, is it always harder to find a 'significant' difference when I use a post test following ANOVA?

Post tests control for multiple comparisons. The significance level doesn't apply to each comparison, but rather to the entire family of comparisons. In general, this makes it harder to reach significance. This is really the main point of multiple comparisons, as it reduces the chance of being fooled by differences that are due entirely to random sampling.

But post tests do more than set a stricter threshold of significance. They also use the information from all of the groups, even when comparing just two. It uses the information in the other groups to get a better measure of variation. Since the scatter is determined from more data, there are more degrees of freedom in the calculations, and this usually offsets some of the increased strictness mentioned above.

In some cases, the effect of increasing the df overcomes the effect of controlling for multiple comparisons. In these cases, you may find a 'significant' difference in a post test where you wouldn't find it doing a simple t test. In the example below, comparing groups 1 and 2 by unpaired t test yields a two-tail P value equals 0.0122. If we set our threshold of 'significance' for this example to 0.01, the results are not 'statistically significant'. But if you compare all three groups with one-way ANOVA, and follow with a Tukey post test, the difference between groups 1 and 2 is statistically significant at the 0.01 significance level.

Group 1

Group 2

Group 3

34

43

48

38

45

49

29

56

47

Why don't post tests report exact P values?

Multiple comparison tests tell you the significance level of a comparison, but do not report P values. There are two reasons for this:

The critical values for most post tests (Bonferroni is an exception) come from tables that are difficult to compute. Prism simply reads the values from a table stored with the program, so they can only bracket the P value as less than, or greater than, a few key values (0.05, 0.01).

There is also a second, conceptual issue. The probabilities associated with post tests apply to the entire family of comparisons. So it makes sense to pick a threshold and ask which comparisons are "significant" at the proposed significance level. It makes less sense, perhaps no sense, to compute a P value for each individual comparison.

Is it enough to notice whether or not two sets of error bars overlap?

If two SE error bars overlap, you can be sure that a post test comparing those two groups will find no statistical significance. However if two SE error bars do not overlap, you can't tell whether a post test will, or will not, find a statistically significant difference.

If you plot SD error bars, rather than SEM, the fact that they do (or don't) overlap does not let you reach any conclusion about statistical significance.



Copyright (c) 2007 GraphPad Software Inc. All rights reserved.
URL: http://www.graphpad.com/help/Prism5/Prism5Help.html?stat_results_post_tests.htm