If you know twelve concepts about a given topic you will look like an expert to people who only know two or three.

Scott Adams, creator of Dilbert

When learning statistics, it is easy to get bogged down in the details, and lose track of the big picture. Here are the twelve most important concepts in statistical inference.

The whole point of inferential statistics is to extrapolate from limited data to make a general conclusion. "Descriptive statistics" simply describes data without reaching any general conclusions. But the challenging and difficult aspects of statistics are all about reaching general conclusions from limited data.

The word ‘intuitive’ has two meanings. One meaning is “easy to use and understand.” That was my goal when I wrote Intuitive Biostatistics. The other meaning of 'intuitive' is “instinctive, or acting on what one feels to be true even without reason.” Using this definition, statistical reasoning is far from intuitive. When thinking about data, intuition often leads us astray. People frequently see patterns in random data and often jump to unwarranted conclusions. Statistical rigor is needed to make valid conclusions from data.

"Statistics means never having to say you are certain." If a statistical conclusion ever seems certain, you probably are misunderstanding something. The whole point of statistics is to quantify uncertainty.

Every statistical inference is based on a list of assumptions. Don't try to interpret any statistical results until after you have reviewed that list. An assumption behind every statistical calculation is that the data were randomly sampled, or at least representative of, a larger population of values that could have been collected. If your data are not representative of a larger set of data you could have collected (but didn't), then statistical inference makes no sense.

Analyzing data requires many decisions. Parametric or nonparametric test? Eliminate outliers or not? Transform the data first? Normalize to external control values? Adjust for covariates? Use weighting factors in regression? All these decisions (and more) should be part of experimental design. When decisions about statistical analysis are made after inspecting the data, it is too easy for statistical analysis to become a high-tech Ouja board -- a method to produce preordained results, rather an objective method of analyzing data. The new name for this is p-hacking.

Say you've computed the mean of a set of values you've collected,or the proportion of subjects where some event happened. Those values describe the sample you've analyzed. But what about the overall population you sampled from? The true population mean (or proportion) might be higher, or it might be lower. The calculation of a 95% confidence interval takes into account sample size and scatter. Given a set of assumptions, you can be 95% sure that the confidence interval includes the true population value (which you could only know for sure by collecting an infinite amount of data). Of course, there is nothing special about 95% except tradition. Confidence intervals can be computed for any degree of desired confidence. Almost all results -- proportions, relative risks, odds ratios, means, differences between means, slopes, rate constants... -- should be accompanied with a confidence interval.

The logic of a P value seems strange at first. When testing whether two groups differ (different mean, different proportion, etc.), first hypothesize that the two populations are, in fact, identical. This is called the null hypothesis. Then ask: If the null hypothesis were true, how unlikely would it be to randomly obtain samples where the difference is as large (or even larger) than actually observed? If the P value is large, your data are consistent with the null hypothesis. If the P value is small, there is only a small chance that random chance would have created as large a difference as actually observed. This makes you question whether the null hypothesis is true. If you can't identify the null hypothesis, you cannot interpret the P value.

If the P value is less than 0.05 (an arbitrary, but well accepted threshold), the results are deemed to be statistically significant. That phrase sounds so definitive. But all it means is that, by chance alone, the difference (or association or correlation..) you observed (or one even larger) would happen less than 5% of the time. That's it. A tiny effect that is scientifically or clinically trivial can be statistically significant (especially with large samples). That conclusion can also be wrong, as you'll reach a conclusion that results are statistically significant 5% of the time just by chance.

If a difference is not statistically significant, you can conclude that the observed results are not inconsistent with the null hypothesis. Note the double negative. You cannot conclude that the null hypothesis is true. It is quite possible that the null hypothesis is false, and that there really is a difference between the populations. This is especially a problem with small sample sizes. It makes sense to define a result as being statistically significant or not statistically significant when you need to make a decision based on this one result. Otherwise, the concept of statistical significance adds little to data analysis.

When many hypotheses are tested at once, the problem of multiple comparisons makes it very easy to be fooled. If 5% of tests will be "statistically significant" by chance, you expect lots of statistically significant results if you test many hypotheses. Special methods can be used to reduce the problem of finding false, but statistically significant, results, but these methods also make it harder to find true effects. Multiple comparisons can be insidious. It is only possible to correctly interpret statistical analyses when all analyses are planned, and all planned analyses are conducted and reported. However, these simple rules are widely broken.

A statistically significant correlation or association between two variables may indicate that one variable causes the other. But it may just mean that both are influenced by a third variable. Or it may be a coincidence.

By the time you read a paper, a great deal of selection has occurred. When experiments are successful, scientists continue the project. Lots of other projects get abandoned.When the project is done, scientists are more likely to write up projects that lead to remarkable results, or to keep analyzing the data in various ways to extract a "statistically significant" conclusion. Finally, journals are more likely to publish “positive” studies. If the null hypothesis were true, you would expect a statistically significant result in 5% of experiments. But those 5% are more likely to get published than the other 95%.