This guide is for an old version of Prism. Browse the latest version or update Prism

1.Compute the square of the difference between each value and the sample mean.

2.Add those values up.

3.Divide the sum by N-1. This is called the variance.

4.Take the square root to obtain the Standard Deviation.

Why divide by n-1 rather than N in the third step above? In step 1, you compute the difference between each value and the mean of those values. You don't know the true mean of the population; all you know is the mean of your sample. Except for the rare cases where the sample mean happens to equal the population mean, the data will be closer to the sample mean than it will be to the true population mean. So the value you compute in step 2 will probably be a bit smaller (and can't be larger) than what it would be if you used the true population mean in step 1. To make up for this, we divide by n-1 rather than n.

But why n-1? If you knew the sample mean, and all but one of the values, you could calculate what that last value must be. Statisticians say there are n-1 degrees of freedom.

The n-1 equation is used in the common situation where you are analyzing a sample of data and wish to make more general conclusions. The SD computed this way (with N-1 in the denominator) is your best guess for the value of the SD in the overall population.

If you simply want to quantify the variation in a particular set of data, and don't plan to extrapolate to make wider conclusions, compute the SD using N in the denominator. The resulting SD is the SD of those particular values, but will most likely underestimate the SD of the population from which those points were drawn.

The goal of science is always to generalize, so the equation with n in the denominator should not be used when analyzing scientific data. The only example I can think of where it might make sense to use n (not n-1) in the denominator is in quantifying the variation among exam scores. But much better would be to show a scatterplot of every score, or a frequency distribution histogram.

Prism always computes the SD using n-1.

The SD quantifies scatter, so clearly you need more than one value! Is two values enough? Many people believe it is not possible to compute a SD from only two values. But that is wrong. The equation that calculates the SD works just fine when you have only duplicate (n=2) data.

Are the results valid? There is no mathematical reason to think otherwise, but I answered the question with simulations. I simulated ten thousand data sets with n=2 and each data point randomly chosen from a Gaussian distribution. Since all statistical tests are actually based on the variance (the square of the SD), I compared the variance computed from the duplicate values with the true variance. The average of the 10,000 variances of simulated data was within 1% of the true variance from which the data were simulated. This means that the SD computed from duplicate data is a valid assessment of the scatter in your data. It is equally likely to be too high or too low, but is likely to be pretty far from the true SD.

Excel can compute the SD from a range of values using the STDEV() function. For example, if you want to know the standard deviation of the values in cells B1 through B10, use this formula in Excel:

=STDEV(B1:B10)

That function computes the SD using n-1 in the denominator. If you want to compute the SD using N in the denominator (see above) use Excel's STDEVP() function.