The method of False Discovery Rate and "Significance" Asterisks: Generally not meant to be mixed!
The (shorter) reason why this is a problem
With the release of Prism 9.4.0, we introduced the ability to use the Pairwise Comparisons feature with the results of the Multiple t test analysis. Using this feature, a summary of the P value calculated for each of the individual t tests is displayed on the graph (either numerically, or as “ns” or as one or more asterisks). As long as the analysis uses the right set of options, there’s nothing to worry about. Specifically, the analysis should use the statistical hypothesis testing method for correcting multiple comparisons (this method specifies a cutoff - or alpha - for P values to be considered “significant” or not). Moreover, the upper limit (the alpha) specified for these comparisons should be set to 0.05 to ensure that the displayed results are accurate.
The main problem with the Multiple t test analysis is that - by default - it relies on the method of controlling the False Discovery Rate to correct for multiple comparisons. These results have a very different interpretation than corrections that use statistical hypothesis testing. As such, the summaries displayed by the Pairwise Comparisons graphing feature likely don’t make any sense. This also applies to other analyses like one- or two-way ANOVA that use the FDR method to correct for multiple comparisons (however, this is not the default method for these analyses, and so less likely to be encountered).
I don't have a lot of time, what do I do?
The simple answer is this: if your analysis uses the method of controlling the False Discovery Rate (FDR) to correct for multiple comparisons, do not use the Pairwise Comparisons graph feature to display asterisks or P values on your graph. Instead, you should do one of the following:
- Display the numeric q values (the dialog states “P values”: this will be updated in the future)
- Remove the pairwise comparison lines and summaries
- Change the analysis to use a P value threshold (statistical hypothesis testing)
Prism uses a couple different versions of wording for the analysis option that indicates that a P value threshold (or statistical hypothesis testing) method will be used. These include:
- Correct for multiple comparisons using statistical hypothesis testing
- Set threshold for P value (or adjusted P value)
As long as you choose one of these analysis options as the method to correct for multiple comparisons, you should be fine to continue using the “Pairwise Comparisons” graphing feature*.
* Technically you also need to ensure that you use an alpha value of 0.05. Read on to get more information on the specifics of this problem.
The (longer) reason why this is a problem
Why do we correct for multiple comparisons?
Many of you are probably at least somewhat familiar with the multiple comparisons problem. Basically, whenever we compare two groups, there is a possibility that we incorrectly believe there is a difference between them when in reality there is not. This is due to random sampling from the population, and this sort of mistake is called a “Type I error” (or a false positive). In statistics lingo, we would say that we had “rejected the null hypothesis of no difference when it was true.” Traditionally, we set a limit for the chance to make this type of error at 5% (or 0.05 as a fraction), and we call this limit alpha.
For a single comparison, we can calculate a P value which tells us the probability of observing a difference as great or greater than what we observed if the null hypothesis is true. If the P value is smaller than the specified alpha, then we can reject the null hypothesis and declare the comparison “statistically significant”. The problem is that if we make multiple comparisons, the chance of encountering a Type I error grows exponentially. Sticking with the traditional value of alpha of 5%, the chance of making at least one Type I error with a single comparison is (unsurprisingly) 5%. However, with ten comparisons, the chance of making at least one Type I error increases to 40%. By the time we reach 40 comparisons, that chance has exceeded 90%.
So we need methods to protect against this problem of multiple comparisons. In Prism, there are two primary methods:
- Correction using statistical hypothesis testing
- Correction using the method of False Discovery Rate
Statistical hypothesis testing or setting a threshold for P values
This is likely the most familiar technique used to correct for multiple comparisons. Basically, the idea is that the value of alpha could be decreased based on the number of comparisons being made to ensure that the total chance of making at least one Type I error wouldn’t exceed the original specified alpha value (typically 5% or 0.05). For each comparison, the calculated P value would be compared to this (smaller) adjusted alpha value to determine if the null hypothesis should be rejected. In Prism, this adjusted alpha value isn’t reported. Instead, Prism uses a mathematically equivalent approach of calculating adjusted P values and comparing them to the unadjusted alpha value (in this approach, adjusted P values are larger than their un-adjusted counterparts). The end result is the same: if the (adjusted) P value is smaller than the original specified alpha value, then you would reject the null hypothesis that there is no difference between the groups.
This method is sometimes referred to as controlling the false positive rate. The table below can help us understand why this is the case:
"Statistically significant" comparisons |
"Not significant" comparisons |
Total | |
Comparisons where the null hypothesis is true | A | B | A+B |
Comparisons where the null hypothesis is false | C | D | C+D |
Total | A+C | B+D | A+B+C+D |
Using the table above, we can see that the total number of comparisons in which the null hypothesis is true is given as “A+B”. Of this group of comparisons, there may be some that we deem “statistically significant” even though the null hypothesis is true (a Type I error), and are given by “A” in the table above. This value of A is the total number of “false positives” that we have. But how many false positives should we consider to be too many? That will ultimately depend on how many comparisons for which the null hypothesis is actually true (“A+B” in the table above). One false positive out of three comparisons where the null hypothesis is true might be too many, while we would probably think that one out of a thousand is probably acceptable. Since we can’t know how many comparisons for which the null is actually true, we decide to define alpha as A/(A+B). This is the fraction of comparisons in which the null hypothesis is true that we (incorrectly) identify as ‘statistically significant”. In other words, this is the false positive rate!
Methods of correcting for multiple comparisons using statistical hypothesis testing are designed to ensure that the false positive rate for all comparisons doesn’t exceed the specified value of alpha. Traditionally, this value of alpha is set to 5% (or 0.05, or 5 out of 100). This is the basis on which the Pairwise Comparisons graph feature was implemented, and presents summaries for the P values calculated for each comparison using the following definitions:
- “ns” means P ≥ 0.05
- “✱” means P < 0.05
- “✱✱” means P < 0.01
- “✱✱✱” means P < 0.001
- “✱✱✱✱” means P < 0.0001
Another way to think about these summaries is to say that a comparison with the “✱” summary would be significant at an alpha level of 0.05, a comparison with the “✱✱” summary would be significant at an alpha level of 0.01, and so on.
Using the False Discovery Rate method to correct for multiple comparisons
Although the statistical hypothesis testing approach may be more familiar, other methods have been developed to correct for multiple comparisons. One such approach is to control the so-called “False Discovery Rate”. Without diving into the mathematical details of this method, let’s try to understand its main objective by looking at the table from the previous section again:
"Statistically significant" comparisons |
"Not significant" comparisons |
Total | |
Comparisons where the null hypothesis is true | A | B | A+B |
Comparisons where the null hypothesis is false | C | D | C+D |
Total | A+C | B+D | A+B+C+D |
For statistical hypothesis testing, we were interested in controlling the “false positive” rate. This meant looking at the number of “statistically significant” comparisons in which the null hypothesis was true (A in the table above) and the total number of comparisons in which the null hypothesis was true (A+B in the table above). The ratio of A/(A+B) can be considered the “false positive rate”, and this method is meant to ensure that this rate does not exceed a specified value (called alpha, generally set to 5% or 0.05).
In contrast, the “False Discovery Rate” (or FDR from here on out) is meant to control a slightly different ratio. In the table above, the total number of comparisons deemed “statistically significant” is given by “A+C”. For some of these comparisons, the null hypothesis will be false (C), while for others the null hypothesis will be true (A). As before, we can’t know how many comparisons for which the null is true. Instead, we can specify a value for the ratio of A/(A+C) which we’ll call Q (note the capitalization). This ratio is called the “false discovery rate” and represents the fraction of comparisons deemed “statistically significant” in which the null hypothesis is actually true. Some common values for Q include 1% (0.01) and 5% (0.05).
Just like with significance testing, methods have been developed to account for the number of comparisons being made when determining which comparisons meet the specified criteria. Note that the term “statistical significance” really doesn’t apply when using the FDR correction method. So how does it work? Basically, you start by calculating the P values for all comparisons. Next, sort the P values from smallest to largest. Starting with the smallest P value, compare this to a threshold that’s calculated using the specified Q value, the number of comparisons, and which position in the ordered P values you’re looking at (first, second, third, etc.). If the P value is smaller than the calculated threshold for that comparison, repeat the process for the next comparison and P value. Do this until you get to a P value that’s larger than the threshold value. Every P value in the list up to (but not including) this one is considered a “Discovery”.
In Prism, the individual threshold value for each P value isn’t reported. Instead, a mathematically equivalent process is presented. After sorting the P values, a new q value (note the capitalization) is calculated. This q value can be considered an “FDR adjusted” P value, and is calculated using the same formula as what is used to calculate the individual threshold values (just rearranged to solve for a different variable). After calculating these q values for the ordered P values, you can simply compare the q values to the specified Q value (as a fraction). If q < Q, then the comparison is considered a “Discovery”, otherwise it is not. The end result of the FDR method is that no more than Q percent of the comparisons identified as “Discoveries” will actually be “False Discoveries”.
This method is commonly known to be much more powerful than statistical hypothesis testing when the number of comparisons is extremely large (thousands of comparisons), but has also been demonstrated to be more powerful even with small to moderate numbers of comparisons. As a result, this is the default method that Prism uses when performing the Multiple t test analysis as well as the Analyze a stack of P values analysis. The FDR method is also available when correcting for multiple comparisons following one-, two-, and three-way ANOVA. However, as will be explained in more detail in a later section, this method of multiple comparison correction does not rely on the concept of alpha in the same way that statistical hypothesis testing does, and so it does not make sense to use multiple asterisks to summarize these comparisons (the way that the Pairwise Comparisons graphing feature does).
A simple description for "False Positive Rate" and "False Discovery Rate"
Controlling the “False Positive Rate” and controlling the “False Discovery Rate” may sound very similar at first. However, these two methods address different issues:
- Controlling the “False Positive Rate” (using statistical hypothesis testing and alpha) is used to ensure that of the comparisons in which no difference truly exists, less than a given percent (alpha) will be identified to be different
- Controlling the “False Discovery Rate” (using FDR and Q) is used to ensure that of the comparisons in which a difference was identified, less than a given percent (Q) will truly be no different
The problem of FDR and the Pairwise Comparisons graphing feature
In the previous section, we explored how comparisons were identified as “Discoveries” or not when using the FDR method of multiple comparison correction. This method used the defined value of Q which represents the limit for the “false discovery rate” across the comparisons. This value is distinct from the concept of alpha, which represents the limit for the “false positive rate” across the comparisons.
The problem is that (currently), the Pairwise Comparisons graphing feature in Prism only uses asterisk methods for summarizing P values that rely on a defined value of alpha (using the traditional threshold of 0.05). As shown above, asterisks used in this feature are interpreted as follows:
- “ns” means P ≥ 0.05
- “✱” means P < 0.05
- “✱✱” means P < 0.01
- “✱✱✱” means P < 0.001
- “✱✱✱✱” means P < 0.0001
However, using this sort of notation on a graph of data that utilized the FDR method of multiple comparison correction is at best unhelpful, and at worst misleading to viewers of the graph. There are a couple of reasons for this. First, the value of interest from the FDR method is the q value which is compared to the Q value (note the capitalization of each of these) to determine if the comparison should be considered a “Discovery” or not. Despite being used in a somewhat similar fashion, these q values are not true P values, and cannot be interpreted in the same way. If you choose to use the Pairwise Comparisons graphing feature, Prism tries to apply these same thresholds to the calculated q values for each comparison. However, the results will almost certainly be uninterpretable. For example, using the default Q value of 1% (or 0.01), a comparison with a q value of 0.0110 would not be considered a “Discovery” (because q > Q). However, using the pairwise comparisons graphing feature, this comparison would be given the summary “✱”. This would suggest to a viewer that this comparison has met the criteria of the method, even though this is clearly not the case.
The second (and more general) problem is that the interpretation of P value summaries that use multiple asterisks/stars for various thresholds generally suggest that an alpha value of 0.05 is the threshold for “significance”. Since this is a very common value for alpha, this is often not an issue. However, consider the situation where alpha was specified as 0.10. In this case, a comparison with a P value of 0.043 would be considered “Significant”, while the P value summary (according to the list above) would be “ns” (generally read as “not significant”). These two results are in conflict with each other! In reality, “ns” simply means “P ≥ 0.05”, but that isn’t how this summary is typically interpreted, and it doesn’t provide any useful information about how this P value actually compares to the specified alpha threshold.
The second (and more general) problem is that the interpretation of P value summaries that use multiple asterisks/stars for various thresholds generally suggest that an alpha value of 0.05 is the threshold for “significance”. Since this is a very common value for alpha, this is often not an issue. However, consider the situation where alpha was specified as 0.10. In this case, a comparison with a P value of 0.058 would be considered “Significant” since the P value is smaller than the specified alpha value. However, the P value summary (according to the list above) would be “ns” (generally read “not significant”), giving the impression to the reader that this comparison did not meet the “significance” criterion. These two results are in conflict with each other! In reality, “ns” simply means “P ≥ 0.05”, but that isn’t how this summary is typically interpreted, and it doesn’t provide any useful information about how this P value actually compares to the specified alpha threshold.
A better solution for graphing multiple comparisons
A better solution for graphing multiple comparisons results using asterisks/stars would be to consider both the P value (or q value) as well as the specified threshold against which it’s being compared (either alpha or Q). If P < alpha (or if q < Q), the comparison is assigned a single asterisk, otherwise it is given the summary “ns” for “not significant” (or “nd” for “not a discovery”). This would ensure that the summaries displayed on a graph would be both correct and informative, regardless of the multiple comparisons correction method that was used in the analysis of the data.
This change will be made to Prism in a future version of the software, and will likely be the default P value or q value summary method offered by the software. If the FDR method of multiple comparison correction is used, this will be the only summary method offered. If statistical hypothesis testing is used, then it will depend on the specified value of alpha. If the traditional value of 0.05 is used, the existing summary methods using multiple asterisks/stars will still be available. However, if any other value for alpha is used, the single asterisk (“significant” vs. “not significant” only) method will be used to ensure that the results displayed on the graph of the data are both clear and informative.
Keywords: fdr significance asterisks pairwise comparisons false discovery rate