| Interpreting P values
What is a P value?
Assume that you've collected data from two samples of animals treated with different drugs. You've measured an enzyme in each animal's plasma, and the means are different. You want to know whether that difference is due to an effect of the drug - whether the two populations have different means. Observing different sample means is not enough to persuade you to conclude that the populations have different means. It is possible that the populations have the same mean (the drugs have no effect on the enzyme you are measuring), and that the difference you observed is simply a coincidence. There is no way you can ever be sure if the difference you observed reflects a true difference or if it is just a coincidence of random sampling. All you can do is calculate probabilities.
Statistical calculations can answer this question: If the populations really have the same mean, what is the probability of observing such a large difference (or larger) between sample means in an experiment of this size? The answer to this question is called the P value.
The P value is a probability, with a value ranging from zero to one. If the P value is small, you'll conclude that the difference between sample means is unlikely to be a coincidence. Instead, you'll conclude that the populations have different means.
What is a null hypothesis?
When statisticians discuss P values, they use the term null hypothesis. The null hypothesis simply states that there is no difference between the groups. Using that term, you can define the P value to be the probability of observing a difference as large or larger than you observed if the null hypothesis were true.
Common misinterpretation of a P value
Many people misunderstand P values. If the P value is 0.03, that means that there is a 3% chance of observing a difference as large as you observed even if the two population means are identical (the null hypothesis is true). It is tempting to conclude, therefore, that there is a 97% chance that the difference you observed reflects a real difference between populations and a 3% chance that the difference is due to chance. However, this would be an incorrect conclusion. What you can say is that random sampling from identical populations would lead to a difference smaller than you observed in 97% of experiments and larger than you observed in 3% of experiments. This distinction may be more clear after you read A Bayesian perspective.
One- vs. two-tail P values
When comparing two groups, you must distinguish between one- and two-tail P values.
Start with the null hypothesis that the two populations really are the same and that the observed discrepancy between sample means is due to chance.
Note: This example is for an unpaired t test that compares the means of two groups. The same ideas can be applied to other statistical tests.
The two-tail P value answers this question: Assuming the null hypothesis is true, what is the chance that randomly selected samples would have means as far apart as (or further than) you observed in this experiment with either group having the larger mean?
To interpret a one-tail P value, you must predict which group will have the larger mean before collecting any data. The one-tail P value answers this question: Assuming the null hypothesis is true, what is the chance that randomly selected samples would have means as far apart as (or further than) observed in this experiment with the specified group having the larger mean?
A one-tail P value is appropriate only when previous data, physical limitations or common sense tell you that a difference, if any, can only go in one direction. The issue is not whether you expect a difference to exist - that is what you are trying to find out with the experiment. The issue is whether you should interpret increases and decreases in the same manner.
You should only choose a one-tail P value when two things are true.
- You must predict which group will have the larger mean (or proportion) before you collect any data.
- If the other group ends up with the larger mean - even if it is quite a bit larger -- you must be willing to attribute that difference to chance.
It is usually best to use a two-tail P value for these reasons:
- The relationship between P values and confidence intervals is easier to understand with two-tail P values.
- Some tests compare three or more groups, which makes the concept of tails inappropriate (more precisely, the P values have many tails). A two-tail P value is more consistent with the P values reported by these tests.
- Choosing a one-tail P value can pose a dilemma. What would you do if you chose to use a one-tail P value, observed a large difference between means, but the "wrong" group had the larger mean? In other words, the observed difference was in the opposite direction to your experimental hypothesis. To be rigorous, you must conclude that the difference is due to chance, even if the difference is huge. While tempting, it is not fair to switch to a two-tail P value or to reverse the direction of the experimental hypothesis. You avoid this situation by always using two-tail P values.
|