![]() |
|
Intuitive Biostatistics: Survival Curves This is chapter preface and Introduction of Intuitive Biostatistics (ISBN 0-19-508607-4) by Harvey Motulsky. Copyright © 1995 by Oxford University Press Inc. All rights reserved. You may order the book from GraphPad Software with a software purchase, from any academic bookstore, or from amazon.com. |
|
APPROACH This book provides a nonmathematical introduction to statistics for medical students, physicians, graduate students, and researchers in the health science. So do plenty of other books, but this one has a unique approach.
TOPICS COVERED In choosing topics to include in this book I've chosen breadth over depth. This is because so many statistical methods are commonly used in the biomedical literature. Flip through any medical or scientific journal and you'll soon find use of a statistical technique not mentioned in most introductory books. To guide those who read those papers, I included many topics omitted from other books: relative risk and odds ratios, prediction intervals, nonparametric tests, survival curves, multiple comparisons, the design of clinical trials, computing the power of a test, nonlinear regression, interpretation of lab tests (sensitivity, specificity, etc.). I also briefly introduce multiple regression, logistic regression, proportional hazards regression, randomization tests, and lod scores. Analysis of variance is given less emphasis than usual. CHAPTERS TO SKIP As statistics books go, this one is pretty short. But realistically that it still is more than most people want to read about statistics. If you just want to learn the main ideas of statistics, with no detail, read Chapters I through 5, 10 through 13, and 19. This book is for anyone who reads papers in the biomedical literature, not just for people who read clinical studies. Basic scientists may want to skip Chapters 6, 9, 20, 21, 32, and 33 which deal with topics uncommonly encountered in basic research. The other chapters are applicable to both clinicians and basic scientists. ANALYZING DATA WITH COMPUTER PROGRAMS We are lucky to live in an era where personal computers are readily available. Although this book gives the equations for many statistical tests, most people will rely on a computer program instead. Unfortunately, most statistics programs are designed for statisticians and are too complicated and too expensive for the average student or scientist. That's why my company, GraphPad Software, created GraphPad InStat, an inexpensive and extremely easy statistical program available for DOS and Macintosh computers (a Windows version is coming soon). Although this book shows sample output from InStat, you do not need InStat to follow the examples in this book or to work the problems. Although spreadsheet programs were originally developed to perform financial calculations, current versions are very versatile and adept at statistical computation. See Appendix 3 to learn how to use Microsoft Excel to perform statistical calculations. REFERENCES AND ACKNOWLEDGMENTS I have organized this book in a unique way, but none of the ideas are particularly original. All of the statistical methods are standard, and have been discussed in many textbooks. Rather than give the original reference for each method, I have listed text book references in Appendix 1. I would like to thank everyone who reviewed various sections of the book in draft form and gave valuable comments, including Jan Agosti, Cedric Garland, Ed Jackson, Arno Motulsky, Paige Searle, and Christopher Sempos. I especially want to thank Harry Frank, whose lengthy comments improved this book considerably. This book would be very different if it weren't for his repeated lengthy reviews. I also want to thank all the students who helped me shape this book over the last five years. Of course, any errors are my own responsibility. Please email comments and suggestions to HMotulsky@graphpad.com. Introduction to Statistics There is something fascinating about science. One gets such a wholesale return of conjecture out of a trifling investment of fact. Mark Twain (Life on the Mississippi, 1850) This is a book for "consumers" of statistics. The goals are to teach you enough statistics to:
Many statistical books read like cookbooks; they contain the recipes for many statistical tests, and their goal (often unstated) is to train "statistical chefs," able to whip up a P value on a moments notice. This book is based on the assumption that statistical tests are best calculated by computer programs or by experts. This book, therefore, will not teach you to be a chef, but rather to become an educated connoisseur or critic who can appreciate and criticize what the chef has created. But just as you must learn a bit about the differences between broiling, boiling, baking, and basting to become a connoisseur of fine food, you must learn a bit about probability distributions and null hypotheses to become an educated consumer of the biomedical literature. Hopefully this book will make it relatively painless. WHY DO WE NEED STATISTICAL CALCULATIONS? When analyzing data, your goal is simple: You wish to make the strongest possible conclusions from limited amounts of data. To do this, you need to overcome two problems:
MANY KINDS OF DATA CAN BE ANALYZED WITHOUT STATISTICAL ANALYSIS Statistical calculations are most helpful when you are looking for fairly small differences in the face of considerable biological variability and imprecise measurements. Basic scientists asking fundamental questions can often reduce biological variability by using inbred animals or cloned cells in controlled environments. Even so, there will still be scatter among replicate data points. If you only care about differences that are large compared with the scatter, the conclusions from such studies can he obvious without statistical analysis. In such experimental systems, effects small enough to require statistical analysis are often not interesting enough to pursue. If you are lucky enough to be studying such a system, you may heed the following aphorisms:
Most scientists are not so lucky. In many areas of biology, and especially in clinical research, the investigator is faced with enormous biological variability, is not able to control all relevant variables, and is interested in small effects (say 20% change). With such data, it is difficult to distinguish the signal you are looking for from the noise created by biological variability and imprecise measurements. Statistical calculations are necessary to make sense out of such data . STATISTICAL CALCULATIONS EXTRAPOLATE FROM SAMPLE TO POPULATION Statistical calculations allow you to make general conclusions from limited amounts of data, You can extrapolate from your data to a more general case. Statisticians say that you extrapolate from a sample to a population. The distinction between sample and population is key to understanding much of statistics. Here are four different contexts where the terms are used.
In biomedical research, we usually assume that the population is infinite, or at least very large compared with our sample. All the methods in this book are based on that assumption. If the population has a defined size, and you have sampled a substantial fraction of the population (>10% or so), then you need to use special methods that are not presented in this book. WHAT STATISTICAL CALCULATIONS CAN DO Statistical reasoning uses three general approaches: Statistical Estimation The simplest example is calculating the mean of a sample. Although the calculation is exact, the mean you calculate from a sample is only an estimate of the population mean. This is called a point estimate. How good is the estimate? As we will see in Chapter 5, it depends on the sample size and scatter. Statistical calculations combine these to generate an interval estimate (a range of values), known as a confidence interval for the population mean. If you assume that your sample is randomly selected from (or at least representative of) the entire population, then you can be 95% sure that the mean of the population lies somewhere within the 95% confidence interval, and you can be 99% sure that the mean lies within the 99% confidence interval. Similarly, it is possible to calculate confidence intervals for proportions, for the difference or ratio of two proportions or two means, and for many other values. Statistical Hypothesis Testing Statistical hypothesis testing helps you decide whether an observed difference is likely to be caused by chance. Various techniques can be used to answer this question: If there is no difference between two (or more) populations, what is the probability of randomly selecting samples with a difference as large or larger than actually observed? The answer is a probability termed the P value. If the P value is small, you conclude that the difference is statistically significant and unlikely to be due to chance. Statistical Modeling Statistical modeling tests how well experimental data fit a mathematical model constructed from physical, chemical, genetic, or physiological principles. The most common form of statistical modeling is linear regression. These calculations determine "the best" straight line through a particular set of data points, More sophisticated modeling methods can fit curves through data points. WHAT STATISTICAL CALCULATIONS CANNOT DO In theory, here is how you should apply statistical analysis to a simple experiment:
When applying statistical analysis to real data, scientists confront several problems that limit the validity of statistical reasoning. For example, consider how you would design a study to test whether a new drug is effective in treating patients infected with the human immunodeficiency virus (HIV). The population you really care about is all patients in the world, now and in the future, who are infected with HIV. Because you can't access that population, you choose to study a more limited population: HIV patients aged 20 to 40 living in San Francisco who come to a university clinic. You may also exclude from the population patients who are too sick, who are taking other experimental drugs, who have taken experimental vaccines, or who are unable to cooperate with the experimental protocol. Even though the population you are working with is defined narrowly, you hope to extrapolate your findings to the wider population of HIV-infected patients. Randomly sampling patients from the defined population is not practical, so instead you simply attempt to enroll all patients who come to morning clinic during two particular months. This is termed a convenience sample. The validity of statistical calculations depends on the assumption that the results obtained from this convenience sample are similar to those you would have obtained had you randomly sampled subjects from the population. The variable you really want to measure is survival time, so you can ask whether the drug increases life span. But HIV kills slowly, so it will take a long time to accumulate enough data. As an alternative (or first step), you choose to measure the number of helper (CD4) lymphocytes. Patients infected with the HIV have low numbers of CD4 lymphocytes, so you can ask whether the drug increases CD4 cell number (or delays the reduction in CD4 cell count). To save time and expense, you have switched from an important variable (survival) to a proxy variable (CD4 cell count). Statistical calculations are based on the assumption that the measurements are made correctly. In our HIV example, statistical calculations would not be helpful if the antibody used to identify CD4 cells was not really selective for those cells. Statistical calculations are most often used to analyze one variable measured in a single experiment, or a series of similar experiments. But scientists usually draw general conclusions by combining evidence generated by different kinds of experiments. To assess the effectiveness of a drug to combat HIV, you might want to look at several measures of effectiveness: reduction in CD4 cell count, prolongation of life, increased quality of life, and reduction in medical costs. In addition to measuring how well the drug works, you also want to quantify the number and severity of side effects. Although your conclusion must be based on all these data, statistical methods are not very helpful in blending different kinds of data. You must use clinical or scientific judgment, as well as common sense. In summary, statistical reasoning can not help you overcome these common problems:
You need to combine different kinds of measurements to reach an overall conclusion. WHY IS IT HARD TO LEARN STATISTICS? Five factors make it difficult for many students to learn statistics:
ARRANGEMENT OF THIS BOOK Parts I through V present the basic principles of statistics. To make it easier to learn, I have separated the chapters that explain confidence intervals from those that explain P values. In practice, the two approaches are used in parallel. Basic scientists who don't care to learn about clinical studies may skip Chapters 6 (survival curves) and 9 (case-control studies) without loss of continuity. Part VI describes the design of clinical studies and discusses how to determine sample size. Basic scientists who don't care to learn about clinical studies can skip this entire part. However, Chapter 22 (sample size) is of interest to all. Part VII explains the most common statistical tests. Even if you use a computer program to calculate the tests, reading these chapters will help you understand how the tests work. The tests mentioned in this section are described in detail. Part VIII gives an overview of more advanced statistical tests. These tests are not described in detail, but the chapters provide enough information so that you can be an intelligent consumer of papers that use these tests. The chapters in this section do not follow a logical sequence, so you can pick and choose the topics that interest you. The only exception is that you should read Chapter 31 (multiple regression) before Chapters 32 (logistic regression) or the parts of Chapter 33 (comparing survival curves) dealing with proportional hazards regression. The statistical principles and tests discussed in this book are widely used, and I do not give detailed references. For more information, refer to the general textbook references listed in Appendix 1. Visit the GraphPad home page. |