GraphPad.com
< BACK

Intuitive Biostatistics: Multiple Comparisons

This is chapter 13 of Intuitive Biostatistics (ISBN 0-19-508607-4) by Harvey Motulsky. Copyright © 1995 by Oxford University Press Inc. All rights reserved. You may order the book from GraphPad Software with a software purchase, from any academic bookstore, or from amazon.com.

Each example you've encountered so far was designed to answer one question. If you look through any medical journal, you'll discover that few papers present a single P value. It is far more common to see papers that present several - or several dozen -- P values. It's not easy to interpret multiple P values.

COINCIDENCES

If your results are really due to a coincidence (the null hypothesis is true), the P value tells you how rare that coincidence would be. Interpreting multiple P values, therefore, is similar to interpreting multiple coincidences.

Example 13.1

In 1991, President Bush and his wife Barbara both developed hyperthyroidism due to Graves' disease. Could it be a coincidence, or did something cause Graves' disease in both? What is the chance that the president and his wife would both develop Graves' disease just by chance? It's hard to calculate that probability exactly, but it is less than I in a million. Because this would have been such a rare coincidence, they looked hard for a cause in the food, water, or air. No cause was ever found.

Was it really a one in a million coincidence? The problem with this "probability" is that the event had already happened before anyone thought to calculate the probability. To use a gambling analogy, the bets were placed after the ball stopped spinning.

A more appropriate question might be "what is the probability that a prominent person and his or her spouse would both develop the same disease this year?" By including all prominent people and all possible diseases, the probability is much higher. The coincidence no longer seems so strange. You could expand the question to include the next few years, and the answer would be higher still.

Example 13.2

Five children in a particular school got leukemia last year. Is that a coincidence? Or does the clustering of cases suggest the presence of an environmental toxin that caused the disease? That's a very difficult question to answer. It is tempting to estimate the answer to the question "what is the probability that five children in this particular school would all get leukemia this particular year?" You could calculate (or at least estimate) the answer to that question if you knew the overall incidence rates of leukemia among children and the number of children enrolled in the school. The answer will be tiny. Everyone intuitively knows that and so is alarmed by the cluster of cases.

But you've asked the wrong question once you've already observed the cluster of cases. The school only came to your attention because of the cluster of cases, so you need to consider all the other schools and other diseases. The right question is: "what is the probability that five children in any school would develop the same severe disease in the same year?" This is a harder question to answer, because you have to define the population of schools (this city or this state?), the time span you care about (one year or one decade?), and the severity of diseases to include (does asthma count?). Clearly the answer to this question is much higher than the answer to the previous one. When clusters occur, it is always worth investigating for known toxins and to be alert to other findings that might suggest a real problem. But most disease clusters are due to coincidence. It is surprising to find a cluster of one particular disease in one particular place at any one particular time. But chance alone will cause many clusters of various diseases in various places at various times.

Example 13.3

You go to a casino, and watch the roulette wheel. You notice that in the first 38 spins of a roulette wheel, the ball landed on red 25 times. By chance you'd only expect the ball to land on red 18 times (and 18 times on black, and once each on 0 and 00). Using the binomial distribution, you can calculate that the chance of having the ball land on 25 or more red slots in the first 38 spins is about 2%. Are you going to place your bets on red?

A rare coincidence? Not really. You would have been just as surprised if the ball had landed 25 times on black, on an odd number, on a small number (I to 18), on a large number (19 to 36), etc. The chance of seeing any one (or more) of these "rare" coincidences is far greater than 2%. It is only fair to calculate the probability of a rare coincidence if you define the coincidence before it happens.

MULTIPLE INDEPENDENT QUESTIONS

Example 13.4

Hunter and colleagues investigated whether vitamin supplementation could reduce the risk of breast cancer (reference 1). The investigators sent dietary questionnaires to over 100,000 nurses in 1980. From the questionnaires, they determined the intake of vitamins A, C, and E and divided the women into quintiles for each vitamin (i.e., the first quintile contains 20% of the women who consumed the smallest amount). They then followed these women for 8 years to determine the incidence rate of breast cancer. Using a test called the chi-square test for trend (which will be briefly discussed in Chapter 29) the investigators calculated a P value to test this null hypothesis: There is no linear trend between vitamin intake quintile and the incidence of breast cancer. There would be a linear trend if increasing vitamin intake was associated with increasing (or decreasing) incidence of breast cancer. There would not be a linear trend if the lowest and highest quintiles had a low incidence of breast cancer compared to the three middle quintiles. The authors determined a different P value for each vitamin. For vitamin C, P = 0.60; for vitamin E, P = 0.07; for vitamin A, P = 0.001.

Interpreting each P value is easy: If the null hypothesis is true, the P value is the chance that random selection of subjects would result in a linear trend as large (or larger) than observed in this study. If the null hypothesis is true, there is a 5% chance of randomly selecting subjects such that the trend is statistically significant. Here we are testing three independent null hypotheses (one for each vitamin tested). If all three null hypotheses are true, the chance that one or more of the P values will be significant is greater than 5%.

Table 13.1 shows what happens when you test several independent null hypotheses. If you leave the threshold at 0.05 for each comparison, the chance of obtaining a "statistically significant" result by chance is greater than 5%. If you want to leave the chance of randomly obtaining a statistically significant result (in any of the comparisons) at 5%, you need to set a stricter (lower) threshold for each comparison. The table shows this value as * - you conclude that a difference is statistically significant only if the P value is less than * .

Table 13.1. Probability of Small P Value When Testing Many Null Hypotheses

Number of Independent Null Hypotheses(N) 2 3 4 5 6 7 8 9 10 20 50 100
Probability (P*) of obtaining one or more P values less than 0.05 by chance 10% 14% 19% 23% 26% 30% 34% 37% 40% 64% 92% 99%
Alpha* to keep overall risk of type I error equal to 0.05 . 0253 .0170 .0127 .0102 .0085 .0073 .0064 .0057 .0051 .0026 .0010 .0005

P* = 100(1.00 - 0.95N)

Alpha* = 1.00 - 0.951/N.

This table assumes that you have set to its usual value of 0.05.

To calculate this table for other values, simply replace "0.95" in the two equations with "(1 - alpha)".
In this example, we are testing three null hypotheses. If we used the traditional cutoff of alpha = 0.05 for declaring each P value to be significant, the table shows that there would be a 14% chance of observing one or more significant P values, even if all three null hypotheses were true. To keep the overall chance at 5%, we need to lower our threshold for significance from 0.050 to 0.0170. With this criteria, the relationship between vitamin A intake and the incidence of breast cancer is statistically significant. The intakes of vitamins C and E are not significantly related to the incidence of breast cancer.

If you don't have access to the table, here is a quick way to approximate the value in the bottom row: Simply divide 0.05 by the number of comparisons. For this example, the threshold is 0.05/3 or 0.017 - the same as the value in the table to three decimal points. If you tested seven hypotheses, the shortcut method calculates a threshold of 0.05/7 or .0071 (which is close to the exact value of 0.0073 on the table). If you make more than 10 comparisons, this shortcut method is not useful.

MULTIPLE GROUPS

Example 13.5

Hetland and coworkers were interested in hormonal changes in women runners (reference 2). Among their investigations, they measured the level of luteinizing hormone (LH) in nonrunners, recreational runners, and elite runners. Because hormone levels were not Gaussian, the investigators transformed their data to the logarithm of concentration and performed all analyses on the transformed data. Although this sounds a bit dubious, it is a good thing to do, as it makes the population closer to a Gaussian distribution. The data are shown in Table 13.2.

Table 13.2. LH Levels in Three Groups of Women

Group log(LH)± SEM N
Nonrunners 0.52 ± 0.027 88
Recreational runners 0.38 ± 0.034 89
Elite runners 0.40 ± 0.049 28

The null hypothesis is that the mean concentration of LH is the same in all three populations. How do you calculate a P value testing this null hypothesis? Your first thought might be to calculate three t tests: one to compare nonrunners with recreational runners, another to compare nonrunners with elite runners, and yet another to compare recreational runners with elite runners. The problem with this approach is that it is difficult to interpret the P values. As you include more groups in the study, you increase the chance of observing one or more significant P values by chance. If the null hypothesis were true (all three populations have the same mean), there is a 5% chance that each particular t test would yield a significant P value. But with three comparisons, the chance that any one (or more) of them will be significant is far higher than 5%.

The authors used a test called one-way analysis of variance (ANOVA) to calculate a single P value answering this question: If the null hypothesis were true, what is the chance of randomly selecting subjects with means as far apart, or further, than observed in this study? The P value is determined from the scatter among means, the standard deviation (SD) within each group, and the size of the samples.

The answer is P = 0.0039. If the null hypothesis were true, there is only a 0.39% chance of randomly picking subjects and ending up with mean values so different. The authors concluded that the null hypothesis was probably not true.

Analysis of variance makes exactly the same assumptions as the t test. The subjects must be representative of a larger population. The data within each population must be distributed according to a Gaussian distribution with equal SDs. Each subject must be selected independently.

Next you want to know which group differs from which other group. But you shouldn't perform three ordinary t tests to find out. Instead, analysis of variance is followed by special tests designed for multiple comparisons. These tests are all named after the statistician(s) who developed them (Tukey, Newman-Keuls, Dunnett, Dunn, Duncan, and Bonferroni). The idea of all the tests is that if the global null hypothesis is true, there is only a 5% chance that any one or more of the comparisons will be statistically significant. The differences between the methods relate to the assumptions you are willing to make and to the number of comparisons you are interested in. You will learn a little bit about the differences between these tests in Chapter 30.

Using Tukey's test to compare each group with each other group with an overall = 0.05, we find that there is a statistically significant difference between the nonrunners and the recreational runners, but not between the nonrunners and the elite runners or between the recreational and elite runners.

MULTIPLE MEASUREMENTS TO ANSWER ONE QUESTION

Example 13.4 Continued

Recall that this study examined the possible relationship between vitamin intake and the incidence of breast cancer. You've already seen that they found a significant relationship between vitamin A intake and incidence of breast cancer. In addition to the three main analyses already mentioned, the authors analyzed their data in many other ways. If we focus only on vitamin A, the authors separately analyzed total vitamin A intake and performed vitamin A. They separately analyzed two overlapping time periods (1980 to 1988 and 1984 to 1988). For each analysis, they analyzed the data using a simple test and again using a fancier test to adjust for known factors that influence the incidence of breast cancer (for example, the number of children, age at first birth, age at menarche, family history of breast cancer). With two measures of vitamin A, two time periods, and two analytical methods they generated eight P values.

These eight P values don't test eight independent null hypotheses, so you shouldn't use the methods presented earlier in Table 13.1. The null hypotheses are interrelated - they are sort of measuring the same thing, but not quite. There are no good methods for dealing with this situation. Basically you need to look at tile collection of P values and get an overall feel for what is going on. In this example, all four P values from the 1980 to 1988 analyses were tiny. The association between vitamin A intake and protection from breast cancer was substantial regardless of whether they looked at total or preformed vitamin A and whether or not they adjusted for other risk factors. In the 1984 to 1988 study, the associations all went in the same direction (more vitamin A = less breast cancer) but the P values were a bit larger. Putting all this together, the evidence is fairly persuasive.

A similar kind of problem comes up frequently in analyzing clinical trials. In many clinical trials, investigators measure clinical outcome using a variety of criteria. For example, in a study of a drug to treat sepsis, the main outcome is whether the patient died. Additionally, investigators may collect additional information on the patients who survived: how long they were in the intensive care unit, how long they required mechanical ventilation, and how many days they required treatment with vasopressors. All these outcomes are really measuring the same thing: how long the patient was severely ill. These data can lead to multiple P values but should not be corrected, as shown in previous sections, because the null hypotheses are not independent. To a large degree the various outcomes measure the same thing.

Although clinical studies often measure several outcomes, statistical methods don't deal with this situation very well. You should not make any formal corrections for multiple comparisons. Instead, you should informally integrate all the data before reaching any conclusions.

MULTIPLE SUBGROUPS

After analyzing the data in many studies, it is tempting to look at subgroups. Separately analyze the subjects by age group. Separately analyze the patients with severe disease and mild disease. The problem with doing separate analyses of subgroups is that the chance of making a Type I error (finding a statistically significant difference by chance) goes up.

Example 13.6

This problem was illustrated in a simulated study by Lee and coworkers (reference 3). They pretended to compare survival following two "treatments" for coronary artery disease. They studied a group of patients with coronary artery disease who they randomly divided into two groups. In a real study, they would give the two groups different treatments, and compare survival. In this simulated study, they treated the subjects identically but analyzed the data as if the two random groups actually represented two distinct treatments. As expected, the survival of the two groups was indistinguishable.

They then divided the patients into six groups depending on whether they had disease in one, two, or three coronary arteries, and depending on whether the heart ventricle contracted normally or not. Since these are variables that are expected to affect survival of the patients, it made sense to evaluate the response to "treatment" separately in each of the six subgroups. Whereas they found no substantial difference in five of the subgroups, they found a striking result among the sickest patients. The patients with three-vessel disease who also had impaired ventricular contraction had much better survival under treatment B than treatment A. The difference between the two survival curves were statistically significant with a P value less than 0.025.

If this were a real study, it would be tempting to conclude that treatment B is superior for the sickest patients, and to recommend treatment B to those patients in the future. But this was not a real study, and the two "treatments" reflected only random assignment of patients. The two treatments were identical, so the observed difference was definitely due to chance. It is not surprising that the authors found one low P value out of six comparisons. Referring to Table 13.1, there is a 26% chance that one of six independent comparisons will have a P value less than 0.05, even if all null hypotheses are true. To reduce the overall chance of a Type I error to 0.05, you'd need to reduce to 0.0085 when comparing six groups.

This is a difficult problem that comes up frequently. Beware of analyses of multiple subgroups as you are very likely to encounter small P values, even if all null hypotheses are true.

MULTIPLE COMPARISONS AND DATA DREDGING

In all the examples you've encountered in this chapter, you've been able to account for multiple comparisons because you know about all the comparisons the investigators made. You will be completely misled (and will reach the wrong conclusion) if the investigator made many comparisons but only published the few that were significant. If the null hypothesis is true, a low P value means that a rare coincidence has occurred. But you can't evaluate the rarity of a coincidence unless you know how many different comparisons were made. As you've seen, if you test lots of null hypotheses the chance of observing one or more "significant" P values is far higher than 5%. If you test 100 independent null hypotheses that are all true, for example, you have a 99% chance of obtaining at least one significant P value. You will be completely misled if the investigators show you the significant P values but don't tell you about the others.

To avoid this situation, investigators should follow these rules:

    &Mac240; Analyses should be planned before the data are collected.
    &Mac240; All planned analyses should be completed and reported.

These rules are usually followed religiously for large formal clinical trials, especially when the data will be reviewed by the Food and Drug Administration. However, those rules are often ignored in more informal preliminary studies and in laboratory research. In many cases, the investigator never thought about how to perform the analyses until after perusing the data. Often looking at the data suggests new hypotheses to test.

It is difficult to know how to deal with analyses that don't follow those rules. If the investigators didn't decide what hypotheses to test until after they looked at the data, then they have implicitly performed many tests. When you read the paper, you need to figure out how many hypotheses the investigators really tested. Look at the number of variables, the number of groups, the number of time points, and the number of adjustments. Big studies can easily generate dozens or hundreds of P values. If the investigators implicitly tested many hypotheses, they are apt to find "significant" differences fairly often. To make sense of this kind of study, you need to look at the overall pattern of results and not interpret any individual P values too strongly.

You should always distinguish studies that test a hypothesis from studies that generate a hypothesis. Exploratory analyses of large databases can generate hundreds of P values, and scanning these can generate intriguing research hypotheses. After the hypothesis is generated, however, it is then necessary to test the hypotheses on a different set of data. Some investigators use half the data for exploration to define one or more hypotheses, and then test the hypotheses with the other half of the data. This is a terrific approach if plenty of data are available.

SUMMARY

Most scientific studies generate more than one P value. Some studies generate hundreds of P values. Interpreting multiple P values is difficult. If you make many comparisons, you expect some to have small P values just by chance. Therefore your interpretation of a small P value should be different when the P value is one of many. You'll encounter multiple P values in several situations: asking many independent questions, comparing multiple groups, measuring multiple end points, and reanalyzing data for multiple subgroups. You need to take into account the number of P values generated when interpreting the results. You'll be misled if the investigators calculated many P values, but only showed you the small ones.

References

1. DJ Hunter, JE Manson, GA Coldiz, et al. A prospective study of the intake of vitamins C, E and A and the risk of breast cancer. N Engl J Med 329:234-240, 1993.
2. ML Hetland, J Haarbo, C Christiansen, T Larsen. Running induces menstrual disturbances but bone mass is unaffected, except in amenorrheic women. Am J Med 95:53-60, 1993.
3. KL Lee, JF McNeer, CF Starmer, PJ Hanis, RA Rosati. Clinical judgment and statistics. Lessons from a simulated randomized trial in coronary artery disease. Circulation 61:509-515, 1990.

Visit the GraphPad home page.