Key concepts: multiple comparisons

Print this Topic

Review of the meaning of P value and alpha

Interpreting an individual P value is easy. Assuming the null hypothesis is true, the P value is the probability that random subject selection alone would result in a difference in sample means (or a correlation or an association...) at least as large as that observed in your study.

Alpha is a threshold that you set. If the P value is less than alpha, you deem the comparison "statistically significant'. In other words, if the null hypothesis is true, there is a 5% chance of randomly selecting subjects such that you erroneously infer a treatment effect in the population based on the difference observed between samples

Multiple comparisons

Many scientific studies generate more than one P value. Some studies in fact generate hundreds of P values.

Interpreting multiple P values is difficult. If you test several independent null hypotheses and leave the threshold at 0.05 for each comparison, the chance of obtaining at least one “statistically significant” result is greater than 5% (even if all null hypotheses are true). This graph shows the problem. The probability on the Y axis is computed from N on the X axis using this equation: 100(1.00 - 0.95N).

Remember the unlucky number 13. If you perform 13 independent experiments, your chances are about 50:50 of obtaining at least one 'significant' P value (<0.05) just by chance.

 

Example

Let's consider an example. You compare control and treated animals, and you measure the level of three different enzymes in the blood plasma. You perform three separate t tests, one for each enzyme, and use the traditional cutoff of alpha=0.05 for declaring each P value to be significant. Even if the treatment doesn't actually do anything, there is a 14% chance that one or more of your t tests will be “statistically significant”. To keep the overall chance of a false “significant” conclusion at 5%, you need to lower the threshold for each t test to 0.0170. If you compare 10 different enzyme levels with 10 t tests, the chance of obtaining at least one “significant” P value by chance alone, even if the treatment really does nothing, is 40%. Unless you correct for the multiple comparisons, it is easy to be fooled by the results. Now lets say you test ten different enzymes, at three time points, in two species, with four pre treatments. You can make lots of comparisons, and you are almost certain to find that some of them are 'significant', even if really all null hypotheses are true.

You can only account for multiple comparisons when you know about all the comparisons made by the investigators. If you report only “significant” differences, without reporting the total number of comparisons, others will not be able to properly evaluate your results. Ideally, you should plan all your analyses before collecting data, and then report all the results.

Distinguish between studies that test a hypothesis and studies that generate a hypothesis. Exploratory analyses of large databases can generate hundreds or thousands of P values, and scanning these can generate intriguing research hypotheses. But you can't test hypotheses using the same data that prompted you to consider them. You need to test your new hypotheses with fresh data.

 



Copyright (c) 2007 GraphPad Software Inc. All rights reserved.
URL: http://www.graphpad.com/help/Prism5/Prism5Help.html?beware_of_multiple_comparisons.htm