Using simulations to compute the false discovery rate.
What is the FDR?
The false discovery rate (FDR) is the answer to these equivalent questions:
- If a result is statistically significant, what is the probability that the null hypothesis is really true?
- Of all experiments that reach a statistically significant conclusion, what fraction are false positives (Type I errors)?
In other words, the false discovery rate is the number of false positive results divided by all of the positive results.
The FDR is distinct from the significance level which is the answer to these equivalent questions:
- If the null hypothesis is true, what is the probability that a particular experiment will happen to collect data that generate a P value low enough to reject that null hypothesis?
- Of all experiments you could conduct when the null hypothesis is actually true, in what fraction will you reach a conclusion that the results are statistically significant?
In other words, the significance level is the expected ratio of the number of false positive results divided by all results under the null hypothesis.
A situation where the FDR is easy to compute
Imagine a situation where you are screening a set of drugs for which you have some background information, but not much. Based on prior experience and knowledge of the drug libary you are screening, you estimate (before collecting any data) that 10% these drugs will actually work. This value (10%) is called the prior probability. You choose a sample size, for testing each drug, to have a power of 80%, and define statistical significance to its traditional value of P<0.05. What will happen if you test 1,000 such drugs?
- Of the 100 drugs that really work, we will obtain a statistically significant result in 80 (because our experimental design has 80% power).
- Of the 900 drugs that are really ineffective, we expect to obtain a statistically significant result in 5% (because we set α equal to 0.05). In other words, we expect 5% × 900, or 45, false positives.
- In total, we expect to find a statistically significant effect in 80 + 45 = 125 of the drugs. Of these 45/80 = 56% actually don't work. In these cases the finding is bogus, a coincidence of random sampling, a Type I error, a false discovery. The answer 56% is called the false discovery rate (FDR). As you can see from this example, the FDR depends on the context of the experiment (the prior probability of "success", 10% in this example), the power of the experiment, and your definition of statistical significance.
So even when you see a P value less than 0.05 provides less evidence than most people would guess.
But wait, it gets worse! The FDR when P is just a tiny bit less than 0.05
Those calculations above were for all P values less than 0.05. What if we look only at analyses that lead to a P value just a tiny bit less than 0.05? There is no straightforward way to calculate the result, but it is very easy to simulate using Prism.
Using StatMate, I found that a sample size of 17 in each group gives 80% power to detect a difference of 1.0 between means when the SD = 1.0. I set up this simulation in Prism, and compared the two groups with an unpaired t test.
Let's simulate 1,000,000 analyses. Of these in 100,000, the drug really works (because we are assuming a prior probability of 10%). I used Prism's Monte Carlo analysis to repeat the simulation and analysis 100,000 times, using different values (sampled from Gaussian distributions) each time. The Monte-Carlo analysis saved the P value for the analysis of each simulated data set. Of all these simulations 81% had a P value less than 0.05, about what you'd expect if the design had 80% power. Using Prism's frequency distribution analysis, I asked how many P values were between 0.045 and 0.050. It turns out that there are 1380 simulations with P values in this range.
In the other 900,000 simulations, the null hypothesis is true. Rather than wait for so many simulation, I ran 100,000 simulations and then multiplied the counts by 9. I set up the simulation to generate random data in two columns with all values sampled from Gaussian distributions with the same mean. Of the 100,000 such simulations, 4946 (4.946%) had P values less than 0.05. This is what you'd expect when you repeatedly test data where the null hypothesis is true. Of these, 500 had a P value between 0.049 and 0.050.
Multiply this last value by 9 to get the number you'd get if we did 900,000 simulations. There would be 4500 drugs where the P value is between 0.049 and 0.050 out of 900,000 simulations where the null hypothesis is true.
Now lets figure out the FDR. Out of one million simulations, there were 100,000 cases where the drug really worked. Of these, 1380 had a P value between 0.045 and 0.050. There were 900,000 simulations where the drug did not work (the null hypothesis was true), and 4500 of these had P values between 0.045 and 0.050. So in one million simulations, there are 5680 P values between 0.045 and 0.050. Of these 4500/5680 = 79.2% are false positives. In other words, restricting ourselves to a situation where the P value was just a bit under 0.05 and the prior probability is 10%, the false discovery rate is 79.2%.
Wow. You'd think that a P value a bit less than 0.05 is pretty strong evidence that a drug worked. But in fact, there is a really good chance (79%) that the drug doesn't work but that the results were a Type I error (false positive) due to random sampling.
This article and these simulations were inspired by a paper by David Colquhoun: (2014). An investigation of the false discovery rate and the misinterpretation of P values. [Not yet published]