How to compare two means when the groups have different standard deviations.
The t test assumes equal variances
The standard unpaired t test (but not the Welch t test) assumes that the two sets of data are sampled from populations that have identical standard deviations, and thus identical variances, even if their means are distinct.
Testing whether two groups are sampled from populations with equal variances
As part of the t test analysis, Prism tests this assumption using an F test to compare the variance of two groups. Note that a bug in earlier versions of Prism and InStat gave a P value for the F test that was too small by a factor of two.
Don’t mix up the P value testing for equality of the standard deviations of the groups with the P value testing for equality of the means. That latter P value is the one that answers the question you most likely were thinking about when you chose the t test or one-way ANOVA. The P value that tests for equality of variances answers this question:
If the populations really had identical standard deviations, what is the chance of observing as large a discrepancy among sample standard deviations as occurred in the data (or an even larger discrepancy)?
What to do if the variances differ
If the P value is small, you reject the null hypothesis that both groups were sampled from populations with identical standard deviations (and thus identical variances).
Then what?
There are five possible answers.
- Conclude that the populations are different. In many experimental contexts, the finding of different standard deviations is as important as the finding of different means. If the standard deviations are different, then the populations are different regardless of what the t test concludes about differences between the means. Before treating this difference as a problem to workaround, think about what it teslls you about the data. This may be the most important conclusion from the experiment! Also consider whether the group with the larger standard deviation is heterogeneous. If a treatment was applied to this group, perhaps it only worked on about half of the subjects.
- Transform your data. In many cases, transforming the data can equalize the standard deviations. If that works, you can then run the the t test on the transformed results. Logs are especially useful. (See Chapter 46 of Intuitive Biostatistics for an example). The log transform is appropriate when data are sampled from a lognormal distribution. In other situations, a reciprocal or square root transform may prove useful. Ideally, of course, the transform should have been planned as part of the experimental design.
- Ignore the result. With equal, or nearly equal, sample size (and moderately large samples), the assumption of equal standard deviations is not a crucial assumption. The t test work pretty well even with unequal standard deviations. In other words, the t test is robust to violations of that assumption so long as the sample size isn’t tiny and the sample sizes aren’t far apart. If you want to use ordinary t tests, run some simulations with the sample size you are actually using and the difference in variance you are expecting, to see how far off the t test results are.
- Go back and rerun the t test, checking the option to do the Welch t test that allows for unequal variance. While this sounds sensible, Moser and Stevens (1) have shown that it isn't. If you use the F test to compare variances to decide which t test to use (regular or Welch), you will have increased your risk of a Type I error. Even if the populations are identical, you will conclude that the populations are different more than 5% of the time. Hayes and Cai reach the same conclusion (2). The Welch test must be specified as part of the experimental design.
- Use a permuation test. No GraphPad program offers such a test. The idea is to treat the observed values as a given, and to ask about the distribution of those values to the two groups. Randomly shuffle the values between the two groups, maintaining the original sample size. What fraction of those shuffled data sets have a difference between means as large (or larger) than observed. That is the P value. When the populations have different standard deviations, this test still produces reasonably accurate P values (Good, reference below, page 55). The disadvantage of these tests is that they don't readily yield a confidence interval. Learn more in Wikipedia, or Hyperstat.
What about switching to the nonparametric Mann-Whitney test? At first glance, this seems to be a good solution to the problem of unequal standard deviations. But it isn't! The Mann-Whitney test tests whether the distribution of ranks is different. If you know the standard deviations are different, you already know that the distributions are different. What you may still want to know is whether the means or medians are distinct. But when the groups have different distributions, nonparametric tests do not test whether the medians differ. This is a common misunderstanding.
How to avoid the problem
None of the solutions above are great. It is better to avoid the problem.
One approach to avoiding the problem is to think clearly about the distribution of your data, and transform the data as part of routine data processing. If you know a system creates lognormal data, analyze the logarithms always.
Another solutions is to use the unequal variance (Welch) t test routinely. As mentioned above, it is not a good idea to first test for unequal standard deviations, and use that results as the basis to decide whether to use the ordinary or modified (unequal variance, Welch) t test. But does it make sense to always use the modified test? Ruxton suggests that this is the best thing to do (3). You lose some power when the standard deviations are, in fact, equal but gain power in the cases where they are not.
The Welch t test makes a strange set of assumptions. What would it mean for two populations to have the same mean but different standard deviations? Why would you want to test for that? Swailowsky points out that this situation simply doesn't often come up in science (4). I prefer to think about the unequal variance t test as a way to create a confidence interval. Your prime goal is not to ask whether two populations differ, but to quantify how far apart the two means are. The unequal variance t test reports a confidence interval for the difference between two means that is usable even if the standard deviations differ.
References
1. Moser, B.K. and G.R. Stevens Homogeneity of Variance in the Two Sample Means Test, The American Statistician, 1992;46(1):19-22.
2. Hayes and Cai. Further evaluating the conditional decision rule for comparing two independent means. Br J Math Stat Psychol (2007)
3. Ruxton. The unequal variance t-test is an underused alternative to Student's t-test and the Mann-Whitney U test. Behavioral Ecology (2006) vol. 17 (4) pp. 688
4. S.S. Sawilowsky. Fermat, Schubert, Einstein, and Behrens-Fisher: The Probable Difference Between Two Means With Different Variances. J. Modern Applied Statistical Methods (2002) vol. 1 pp. 461-472