| Interpreting statistical significance
Statistical hypothesis testing
Much of statistical reasoning was developed in the context of quality control where you need a definite yes or no answer from every analysis. Do you accept or reject the batch? The logic used to obtain the answer is called hypothesis testing.
First, define a threshold P value before you do the experiment. Ideally, you should set this value based on the relative consequences of missing a true difference or falsely finding a difference. In practice, the threshold value (called a) is almost always set to 0.05 (an arbitrary value that has been widely adopted).
Next, define the null hypothesis. If you are comparing two means, the null hypothesis is that the two populations have the same mean. In most circumstances, the null hypothesis is the opposite of the experimental hypothesis that the means come from different populations.
Now, perform the appropriate statistical test to compute the P value. If the P value is less than the threshold, state that you "reject the null hypothesis" and that the difference is "statistically significant". If the P value is greater than the threshold, state that you "do not reject the null hypothesis" and that the difference is "not statistically significant". You cannot conclude that the null hypothesis is true. All you can do is conclude that you don't have sufficient evidence to reject the null hypothesis.
Statistical significance in science
The term significant is seductive, and easy to misinterpret. Using the conventional definition with alpha=0.05, a result is said to be statistically significant when the result would occur less than 5% of the time if the populations were really identical.
It is easy to read far too much into the word significant because the statistical use of the word has a meaning entirely distinct from its usual meaning. Just because a difference is statistically significant does not mean that it is biologically or clinically important or interesting. Moreover, a result that is not statistically significant (in the first experiment) may turn out to be very important.
If a result is statistically significant, there are two possible explanations:
- The populations are identical, so there really is no difference. By chance, you obtained larger values in one group and smaller values in the other. Finding a statistically significant result when the populations are identical is called making a Type I error. If you define statistically significant to mean "P<0.05", then you'll make a Type I error in 5% of experiments where there really is no difference.
- The populations really are different, so your conclusion is correct. The difference may be large enough to be scientifically interesting. Or it may be tiny and trivial.
"Extremely significant" results
Intuitively, you may think that P=0.0001 is more statistically significant than P=0.04. Using strict definitions, this is not correct. Once you have set a threshold P value for statistical significance, every result is either statistically significant or is not statistically significant. Degrees of statistical significance are not distinguished. Some statisticians feel very strongly about this. Many scientists are not so rigid, and refer to results as being "barely significant", "very significant" or "extremely significant".
Prism summarizes the P value using the words in the middle column of this table. Many scientists label graphs with the symbols of the third column. These definitions are not entirely standard. If you report the results in this way, you should define the symbols in your figure legend.
|
P value
|
Wording
|
Summary
|
| >0.05 |
Not significant |
ns |
| 0.01 to 0.05 |
Significant |
* |
| 0.001 to 0.01 |
Very significant |
** |
| < 0.001 |
Extremely significant |
*** |
Report the actual P value
The concept of statistical hypothesis testing works well for quality control, where you must decide to accept or reject an item or batch based on the results of a single analysis. Experimental science is more complicated than that, because you often integrate many kinds of experiments before reaching conclusions. You don't need to make a significant/not significant decision for every P value. Instead, report exact P values, so you and your readers can interpret them as part of a bigger picture.
The tradition of reporting only "P<0.05" or "P>0.05" began in the days before computers were readily available. Statistical tables were used to determine whether the P value was less than or greater than 0.05, and it would have been very difficult to determine the P value exactly. Today, with most statistical tests, it is very easy to compute exact P values, and you shouldn't feel constrained to only report whether the P value is less than or greater than some threshold value.
|