Contents

Statistical principles

Analyzing one group

Analyzing two groups

Analysis of variance (ANOVA)

Analyzing survival data

Categorical data
(contingency tables)

Correlation & linear regression:

Correlation

Linear regression

Other kinds of regression

Our Products...
Prism
InStat
StatMate
Intuitive Biostatistics


© 1999 GraphPad Software Inc.

The Prism Guide to Interpreting Statistical Results
This guide is excerpted from Analyzing Data with GraphPad Prism, a book that accompanies the program GraphPad Prism. Browse this guide using the Contents navigation on the left. You may also download the entire book.

Interpreting correlation results

Introduction to correlation

When two variables vary together, statisticians say that there is a lot of covariation or correlation. The correlation coefficient, r, quantifies the direction and magnitude of correlation.

Correlation is not the same as linear regression, but the two are related. Linear regression finds the line that best predicts Y from X. Correlation quantifies how well X and Y vary together. In some situations, you might want to perform both calculations.

Correlation only makes sense when both X and Y variables are outcomes you measure. If you control X (i.e., time, dose, concentration), don't use correlation, use linear regression. See Linear regression.

Correlation calculations do not discriminate between X and Y, but rather quantify the relationship between the two variables.  Linear regression does discriminate between X and Y. Linear regression finds the best line that predicts Y from X by minimizing the sum of the square of the vertical distances of the points from the regression line. The X and Y variables are not symmetrical in the regression calculations. Therefore only choose regression, rather than correlation, if you can clearly define which variable is X and which is Y.  

How correlation works

Correlation coefficient

The correlation coefficient, r, ranges from -1 to +1. The nonparametric Spearman correlation coefficient, abbreviated rs, has the same range.

Value of r (or rs) Interpretation
r= 0 The two variables do not vary together at all.
0 > r > 1

The two variables tend to increase or decrease together.

r = 1.0

Perfect correlation.

-1 > r > 0

One variable increases as the other decreases.

r = -1.0

Perfect negative or inverse correlation.

If r or rs is far from zero, there are four possible explanations:

The X variable helps determine the value of the Y variable.

  • The Y variable helps determine the value of the X variable.
  • Another variable influences both X and Y.
  • X and Y don't really correlate at all, and you just happened to observe such a strong correlation by chance. The P value determines how often this could occur.

r2

Perhaps the best way to interpret the value of r is to square it to calculate r2. Statisticians call this quantity the coefficient of determination, but scientists call it r squared. It is has a value that ranges from zero to one, and is the fraction of the variance in the two variables that is shared. For example, if r2=0.59, then 59% of the variance in X can be explained by variation in Y.  Likewise, 59% of the variance in Y can be explained by (or goes along with) variation in X. More simply, 59% of the variance is shared between X and Y.

Prism only calculates an r2 value from the Pearson correlation coefficient. It is not appropriate to compute r2 from the nonparametric Spearman correlation coefficient.

P value

The P value answers this question: If there really is no correlation between X and Y in the overall population, what is the chance that random sampling would result in a correlation coefficient as far from zero (or further) as observed in this experiment?

How to think about results of linear correlation

Look first at a graph of your data to see how X and Y vary together. Then look at the value of r (or rs) which quantifies the correlation. Finally, look at the P value.

If the P value is small, you can reject the idea that the correlation is a coincidence. Look at the confidence interval for r. You can be 95% sure that the true population r lies somewhere within that range.

If the P value is large, the data do not give you any reason to conclude that the correlation is real. This is not the same as saying that there is no correlation at all. You just have no compelling evidence that the correlation is real and not a coincidence. Look at the confidence interval for r. It will extend from a negative correlation to a positive correlation.  If the entire interval consists of values near zero that you would consider biologically trivial, then you have strong evidence that either there is no correlation in the population or that there is a weak (biologically trivial) association. On the other hand, if the confidence interval contains correlation coefficients that you would consider biologically important, then you couldn't make any strong conclusion from this experiment. To make a strong conclusion, you'll need data from a larger experiment.

Checklist. Is correlation the right analysis for these data?

To check that correlation is an appropriate analysis for these data, ask yourself these questions. Prism cannot help answer them.

Question Discussion
Are the subjects independent? Correlation assumes that any random factor affects only one subject, and not others. You would violate this assumption if you choose half the subjects from one group and half from another. A difference between groups would affect half the subjects and not the other half.
Are X and Y measured independently? The calculations are not valid if X and Y are intertwined. You'd violate this assumption if you correlate midterm exam scores with overall course score, as the midterm score is one of the components of the overall score.
Were X values measured (not controlled)? If you controlled X values (i.e. concentration, dose or time) you should calculate linear regression rather than correlation.
Is the covariation linear? A correlation analysis would not be helpful if Y increases as X increases up to a point, and then Y decreases as X increases further. You might obtain a low value of r even though the two variables are strongly related. The correlation coefficient quantifies linear covariation only.
Are X and Y distributed according to Gaussian distributions?

To accept the P value from standard (Pearson) correlation, the X and Y values must each be sampled from populations that follow Gaussian distributions. Spearman nonparametric correlation does not make this assumption.