|
||
| What is a P value?
Why do we need statistical calculations? When analyzing data, your goal is simple: You wish to make the strongest possible conclusion from limited amounts of data. To do this, you need to overcome two problems:
Statistical analyses are most useful when you are looking for differences that are small compared to experimental imprecision and biological variability. If you only care about large differences, you may follow these aphorisms: If you need statistics to analyze your experiment, then you've done the wrong experiment.If your data speak for themselves, don't interrupt! But in many fields, scientists care about small differences and are faced with large amounts of variability. Statistical methods are necessary. Population vs. samples The basic idea of statistics is simple: you want to extrapolate from the data you have collected to make general conclusions. Statistical analyses are based on a simple model. There is a large population of data out there, and you have randomly sampled parts of it. You analyze your sample to make inferences about the population. Consider several situations:
The logic of statistics assumes that your sample is randomly selected from the population, and that you only want to extrapolate to that population. This works perfectly for quality control. When you apply this logic to scientific data, you encounter two problems:
Assumption of independenceIt is not enough that your data are sampled from a population. Statistical tests are also based on the assumption that each subject (or each experimental unit) was sampled independently of the rest. The assumptions of independence is easiest to understand by studying counterexamples.
P values Definition of a P value Consider an experiment where you've measured values in two samples, and the means are different. How sure are you that the population means are different as well? There are two possibilities:
The P value is a probability, with a value ranging from zero to one. It is the answer to this question: If the populations really have the same mean overall, what is the probability that random sampling would lead to a difference between sample means as large (or larger) than you observed? How are P values calculated? There are many methods, and you'll need to read a statistics text to learn about them. The choice of statistical tests depends on how you express the results of an experiment (measurement, survival time, proportion, etc.), on whether the treatment groups are paired, and on whether you are willing to assume that measured values follow a Gaussian bell-shaped distribution. Common misinterpretation of a P value Many people misunderstand what question a P value answers. If the P value is 0.03, that means that there is a 3% chance of observing a difference as large as you observed even if the two population means are identical. It is tempting to conclude, therefore, that there is a 97% chance that the difference you observed reflects a real difference between populations and a 3% chance that the difference is due to chance. Wrong. What you can say is that random sampling from identical populations would lead to a difference smaller than you observed in 97% of experiments and larger than you observed in 3% of experiments. You have to choose. Would you rather believe in a 3% coincidence? Or that the population means are really different? "Extremely significant" results Intuitively, you probably think that P=0.0001 is more statistically significant than P=0.04. Using strict definitions, this is not correct. Once you have set a threshold P value for statistical significance, every result is either statistically significant or is not statistically significant. Some statisticians feel very strongly about this. Many scientists are not so rigid, and refer to results as being "very significant" or "extremely significant" when the P value is tiny. Often, results are flagged with a single asterisk when the P value is less than 0.05, with two asterisks when the P value is less than 0.01, and three asterisks when the P value is less than 0.001. This is not a firm convention, so you need to check the figure legends when you see asterisks to find the definitions the author used. When comparing two groups, you must distinguish between one- and two-tail P values. Start with the null hypothesis that the two populations really are the same and that the observed discrepancy between sample means is due to chance.
A one-tail P value is appropriate only when previous data, physical limitations or common sense tell you that a difference, if any, can only go in one direction. The issue is not whether you expect a difference to exist - that is what you are trying to find out with the experiment. The issue is whether you should interpret increases and decreases the same. You should only choose a one-tail P value when you believe the following:
It is usually best to use a two-tail P value for these reasons:
Statistical hypothesis testing The P value is a fraction. In many situations, the best thing to do is report that number to summarize the results of a comparison. If you do this, you can totally avoid the term "statistically significant", which is often misinterpreted. In other situations, you'll want to make a decision based on a single comparison. In these situations, follow the steps of statistical hypothesis testing.
Note that statisticians use the term hypothesis testing very differently than scientists. Statistical significance The term significant is seductive, and it is easy to misinterpret it. A result is said to be statistically significant when the result would be surprising if the populations were really identical. A result is said to be statistically significant when the P value is less than a preset threshold value. It is easy to read far too much into the word significant because the statistical use of the word has a meaning entirely distinct from its usual meaning. Just because a difference is statistically significant does not mean that it is important or interesting. And a result that is not statistically significant (in the first experiment) may turn out to be very important. If a result is statistically significant, there are two possible explanations:
There are also two explanations for a result that is not statistically significant:
Confidence intervals Statistical calculations produce two kinds of results that help you make inferences about the populations from the samples. You've already learned about P values. The second kind of result is a confidence interval. 95% confidence interval of a mean Although the calculation is exact, the mean you calculate from a sample is only an estimate of the population mean. How good is the estimate? It depends on how large your sample is and how much the values differ from one another. Statistical calculations combine sample size and variability to generate a confidence interval for the population mean. You can calculate intervals for any desired degree of confidence, but 95% confidence intervals are used most commonly. If you assume that your sample is randomly selected from some population, you can be 95% sure that the confidence interval includes the population mean. More precisely, if you generate many 95% CI from many data sets, you expect the CI to include the true population mean in 95% of the cases and not to include the true mean value in the other 5%. Since you don't know the population mean, you'll never know for sure whether or not your confidence interval contains the true mean. Other situations When comparing groups, calculate the 95% confidence interval for the difference between the population means. Again interpretation is straightforward. If you accept the assumptions, there is a 95% chance that the interval you calculate includes the true difference between population means. Methods exist to compute a 95% confidence interval for any calculated statistic, for example the relative risk or the best-fit value in nonlinear regression. The interpretation is the same in all cases. If you accept the assumptions of the test, you can be 95% sure that the interval contains the true population value. Or more precisely, if you repeat the experiment many times, you expect the 95% confidence interval will contain the true population value in 95% of the experiments. Why 95%? There is nothing special about 95%. It is just convention that confidence intervals are usually calculated for 95% confidence. In theory, confidence intervals can be computed for any degree of confidence. If you want more confidence, the intervals will be wider. If you are willing to accept less confidence, the intervals will be narrower. |
||
|
|