﻿ Interpreting results: Mann-Whitney test

# Interpreting results: Mann-Whitney test

## How it works

The Mann-Whitney test, also called the Wilcoxon rank sum test, is a nonparametric test that compares two unpaired groups. To perform the Mann-Whitney test, Prism first ranks all the values from low to high, paying no attention to which group each value belongs. The smallest number gets a rank of 1. The largest number gets a rank of n, where n is the total number of values in the two groups. Prism then averages the ranks in each group, and reports the two averages. If the means of the ranks in the two groups are very different, the P value will be small.

## P value

You can't interpret a P value until you know the null hypothesis being tested. For the Mann-Whitney test, the null hypothesis is a bit hard to understand. The null hypothesis is that the distributions of both groups are identical, so that there is a 50% probability that an observation from a value randomly selected from one population exceeds an observation randomly selected from the other population.

The P value answers this question:

If the groups are sampled from populations with identical distributions, what is the chance that random sampling would result in the mean ranks being as far apart (or more so) as observed in this experiment?

In most cases (including when ties are present), Prism calculates an exact P value(2). If your samples are large (the smaller group has more than 100 values), it approximates the P value from a Gaussian approximation. Here, the term Gaussian has to do with the distribution of sum of ranks and does not imply that your data need to follow a Gaussian distribution. The approximation is quite accurate with large samples and is standard (used by all statistics programs).

Note that Prism 6 computes the exact P value much faster than did prior versions, so does so with moderate size data sets where Prism 5 would have used an approximate method. It computes an exact P value when the size of the smallest sample is less than or equal to 100, and otherwise computes an approximate one (with such large samples, the approximation is excellent).

If the P value is small, you can reject the null hypothesis that the difference is due to random sampling, and conclude instead that the populations are distinct.

If the P value is large, the data do not give you any reason to reject the null hypothesis. This is not the same as saying that the two populations are the same. You just have no compelling evidence that they differ. If you have small samples, the Mann-Whitney test has little power. In fact, if the total sample size is seven or less, the Mann-Whitney test will always give a P value greater than 0.05 no matter how much the groups differ.

## Mann-Whitney U

Prism reports the value of the Mann-Whitney U value, in case you want to compare calculations with those of another program or text. To compute the U value, pick one value from group A and also pick a value from group B. Record which group has the larger value. Repeat for all values in the two groups.  Total up the number of times that the value in A is larger than B, and the number of times the value in B is larger than the value in A. The smaller of these two values is U.

When computing U, the number of comparisons equals the product of the number of values in group A times the number of values in group B. If the null hypothesis is true, then the value of U should be about half that value. If the value of U is much smaller than that, the P value will be small. The smallest possible value of U is zero. The largest possible value is half the product of the number of values in group A times the number of values in group B.

## The difference between medians and its confidence interval

The Mann-Whitney test compares the distributions of ranks in two groups. If you assume that both populations have distributions with the same shape (which doesn't have to be Gaussian), it can be viewed as a comparison of two medians. Note that if you don't make this assumption, the Mann-Whitney test does not compare medians.

Prism reports the difference between medians only if you check the box to compare medians (on the Options tab). It reports the difference in two ways. One way is the obvious one -- it subtracts the median of one group from the median of the other group. The other way is to compute the Hodges-Lehmann estimate. Prism systematically computes the difference between each value in the first group and each value in the second group. The Hodges-Lehmann estimate is the median of this set of differences. Many think it is the best estimate for the difference between population medians.

Prism computes the confidence interval for the difference using the method explained on page 521-524 of Sheskin (1) and 312-313 of Klotz (3). This method is based on the Hodges-Lehmann method.

Since the nonparametric test works with ranks, it is usually not possible to get a confidence interval with exactly 95% confidence. Prism finds a close confidence level, and reports what it is. For example, you might get a 96.2% confidence interval when you asked for a 95% interval. Prism reports the confidence level it uses, which is as close as possible to the level you requested. When reporting the confidence interval, you can either report the precise confidence level ("96.2%") or just report the confidence level you requested ("95%"). I think the latter approach is used more commonly.

Prism computes an exact confidence interval when the smaller sample has 100 or fewer values, and otherwise computes an approximate interval. With samples this large, this approximation is quite accurate.

## Tied values in the Mann-Whitney test

The Mann-Whitney test was developed for data that are measured on a continuous scale. Thus you expect every value you measure to be unique. But occasionally two or more values are the same. When the Mann-Whitney calculations convert the values to ranks, these values tie for the same rank, so they both are assigned the average of the two (or more) ranks for which they tie.

Prism uses a standard method to correct for ties when it computes U (or the sum of ranks; the two are equivalent).

Unfortunately, there isn't a standard method to get a P value from these statistics when there are ties. When the smaller sample has 100 or fewer values, Prism 6 computes the exact P value, even with ties(2). It tabulates every possible way to shuffle the data into two groups of the sample size actually used, and computes the fraction of those shuffled data sets where the difference between mean ranks was as large or larger than actually observed. When the samples are large (the smaller group has more than 100 values), Prism uses the approximate method, which converts U or sum-of-ranks to a Z value, and then looks up that value on a Gaussian distribution to get a P value.

## Why Prism 6 can report different results than prior versions

There are two reasons why Prism 6 can report different results than prior versions:

 • Exact vs. approximate P values. When samples are small, Prism computes an exact P value. When samples are larger, Prism computes an approximate P value. This is reported in the results. Prism 6 is much (much!) faster at computing exact P values, so will do so with much larger samples. It does the exact test whenever the smaller group has fewer than 100 values.
 • How to handle ties? If two values are identical, they tie for the same rank. Prism 6, unlike most programs,  computes an exact P value even in the presence of ties. Prism 5 and earlier versions always computed an approximate P value, and different approximations were used in different versions. Details.

## Reference

1. DJ Sheskin, Handbook of parametric and nonparametric statistical procedures, 4th edition, 2007, ISBN=1584888148.

2. Ying Kuen Cheung and Jerome H. Klotz, The Mann-Whitney Wilcoxon distribution using linked lists,  Statistical Sinica 7:805-813, 1997.