![]() |
|
Intuitive Biostatistics: Comparing Two Survival Curves This is chapter 33 of Intuitive Biostatistics (ISBN 0-19-508607-4) by Harvey Motulsky. Copyright © 1995 by Oxford University Press Inc. All rights reserved. You may order the book from GraphPad Software with a software purchase, from any academic bookstore, or from amazon.com. |
| You've already learned (Chapter 6) how to interpret survival curves. It is common to compare two survival curves to compare two treatments. Compare two survival curves using the log-rank test. This test calculates a P value testing the null hypothesis that the survival curves are identical in the two populations. If that assumption is true, the P value is the probability of randomly selecting subjects whose survival curves are as different (or more so) than was actually observed. (You will sometimes see survival curves compared with the method of Mantel-Haenszel, rather than the log-rank test. The two methods are essentially equivalent.)
Example 33.1 Rosman and colleagues investigated whether diazepam would prevent febrile seizures in children (ref 1). They recruited about 400 children who had had at least one febrile seizure. Their parents were instructed to give medication to the children whenever they had a fever. Half were given diazepam and half were given placebo. They analyzed the data in several ways, including survival analysis. Here the term survival is a bit misleading, as they compared time until the first seizure, not time until death. When they compared the times to first seizure with the log rank test, the placebo treated subjects tended to have seizures earlier and the P value was 0.06. The difference in survival curves was small. If diazepam was really no more effective than placebo, you'd expect that 6% of experiments this size would find a difference this large or larger. The authors did not reach a conclusion from this analysis because they analyzed the data in a fancier way, which we will discuss later in the chapter. ASSUMPTIONS OF THE LOG-RANK TEST The log-rank test depends on these assumptions: The subjects are randomly sampled from, or at least are representative of, larger populations. The calculations of the log-rank test are tedious and best left to computer. The idea is pretty simple. For each time interval, compare the observed number of deaths in each group with the expected number of deaths if the null hypothesis were true. Combine all the observed and expected values into one chi-square statistic and determine the P value from that. A POTENTIAL TRAP: COMPARING SURVIVAL OF RESPONDERS VERSUS NONRESPONDERS This approach sounds reasonable but is invalid. I treated a number of cancer patients with chemotherapy. The treatment seemed to work with some patients because the tumor became smaller. The tumor did not change size in other patients. I plotted separate survival curves for the responders and nonresponders, and compared them with the log-rank test. The two differ significantly, so I conclude that the treatment prolongs survival. This analysis is not valid, because you only have one group of patients, not two. Dividing the patients into two groups based on response to treatment is not valid for two reasons: A patient cannot be defined to be a "responder" unless he or she survived long enough for you to measure the tumor. Any patient who died early in the study was defined to be a nonresponder, In other words, survival influenced which group the patient was assigned to. Therefore you can't learn anything by comparing survival in the two groups. The general rule is clear: You must define the groups you are comparing (and measure the variables you plan to adjust for) before starting the experimental phase of the study. Be very wary of studies that use data collected during the experimental phase of the study to divide patients into groups or to adjust the data. WILL ROGERS' PHENOMENON Assume that you are tabulating survival for patients with a certain type of tumor. You separately track survival of patients whose cancer has metastasized and survival of patients whose cancer remains localized. As you would expect, average survival is longer for the patients without metastases. Now a fancier scanner becomes available, making it possible to detect metastases earlier. What happens to the survival of patients in the two groups? The group of patients without metastases is now smaller. The patients who are removed from the group are those with small metastases that could not have been detected without the new technology. These patients tend to die sooner than the patients without detectable metastases. By taking away these patients, the average survival of the patients remaining in the "no metastases" group will improve. What about the other group? The group of patients with metastases is now larger. The additional patients, however, are those with small metastases. These patients tend to live longer than patients with larger metastases. Thus the average survival of all patients in the "with-metastases" group will improve. Changing the diagnostic method paradoxically increased the average survival of both groups! Feinstein (reference 2) termed this paradox the Will Rogers' phenomenon from a quote from the humorist Will Rogers ("When the Okies left California and went to Oklahoma, they raised the average intelligence in both states."). MULTIPLE REGRESSION WITH SURVIVAL DATA:PROPORTIONAL HAZARDS REGRESSION Proportional hazards regression applies regression methodology to survival data. This method lets you compare survival in two or more groups after adjusting for other variables. Example 33.1 Continued (Diazepam and Febrile Seizures) The investigators performed proportional hazards regression to adjust for differences in age, number of previous febrile seizures, and several other variables. After those adjustments, they found that the relative risk was 0.61 with a 95% Cl ranging from 0.39 to 0.94. Compared with subjects treated with placebo, subjects treated with diazepam had only 61 % of the risk of having a febrile seizure. This reduction was Statistically significant with a P value of 0.027. If diazepam was ineffective, there is only a 2.7% chance of seeing such a low relative risk in a study of this size. This example shows that the results of proportional hazards regression are easy to interpret, even though the details of the analysis are complicated. HOW PROPORTIONAL HAZARDS REGRESSION WORKS A survival curve plots cumulative survival as a function of time. The slope or derivative of the survival curve is the rate of dying in a short time interval. This is termed the hazard. For example, if 20% of patients with a certain kind of cancer are expected to die this year, then the hazard is 20% per year. When comparing two groups, investigators often assume that the ratio of hazard functions is constant over time. For example, the hazard among treated patients might be one half the hazard in control patients. The death rates change over the course of the study, but at any particular time the treated patients' risk of dying is one half the risk of the control patients. Another way to say this is that the two hazard functions are proportional to one another. This is a reasonable assumption for many clinical situations. The ratio of hazards is essentially a relative risk. If the ratio is 0.5, then the relative risk of dying in one group is half the risk of dying in the other group. proportional hazards regression, also called Cox regression after the person who developed the method, uses regression methods to predict the relative risk based on one or more X variables. The assumption of proportional hazards is not always reasonable. You would not expect the hazard functions of medical and surgical therapy for cancer to be proportional. You might expect that the surgical therapy to have the higher hazard at early times (because of deaths during the operation or soon thereafter) and medical therapy to have the higher hazard at longer times. In such situations, proportional hazards regression should be avoided or used only over restricted time intervals for which the assumption is reasonable. Having accepted the proportional hazards assumption, we want to know how the hazard ratio is influenced by treatment or other variables. One thought might be to place the hazard ratio on the left side of a regression equation. It turns out that the results are cleaner when we take the natural logarithm first.* So the logarithm of the hazard ratio can be placed on the left side of the multiple regression equation to generate Equation 33. 1: ln(hazard ratio) = ln(relative risk) = XI - ln(RR,) + X2 ' ln(RR2)- (33.1) The hazard ratio must be defined relative to a baseline group. The baseline group is subjects in which every X variable equals 0. You can see in Equation 33.1 that the logarithm of the hazard ratio in the baseline group equals 0, so the hazard ratio (by definition) equals 1.0 (antilogarithm of 0). INTERPRETING THE RESULTS OF PROPORTIONAL HAZARDS REGRESSION To run a proportional hazards regression program, you must first enter the data for each subject. Enter the survival time for each subject, along with a code indicating whether the subject died at that time or was censored at that time (see Chapter 6 for a definition of censoring). Also enter all the X variables for each subject. [The term survival is used generally. The event does not have to be death. Proportional hazards regression can be used with any outcome that happens at most once to each subject.] Programs that calculate proportional hazards compute the best-fit values for each of the relative risks (hazard ratios), along with their 95% Cl. If you encounter a program (or publication) that reports P coefficients instead of odds ratios, it is easy to convert. The relative risk for variable Xi equals ebetai. Programs that calculate proportional hazards regression report several P values. One P value tests the overall null hypothesis that in the overall population all relative risks equal I.O. In other words, the overall null hypothesis is that none of the X variables influence survival. If that P value is low, you can reject the overall null hypothesis that none of the X variables influence survival. You can then look at individual P values for each X variable, testing the null hypothesis that that particular relative risk equals 1.0. Whenever you review the results of proportional hazards regression, ask yourself these questions: Are there enough subjects'? A general rule of thumb is that there should be 5 to 10 deaths for each X variable. Don't count the subjects, count the number of deaths. With fewer events, it is too easy to be misled by spurious findings. For example, a study with 1000 patients, 25 of whom die, provides enough data to study the influence of at most 2 to 4 explanatory X variables. Distinguish between studies that generate hypotheses and studies that test hypotheses. If you study enough variables, some relationships are bound to turn up, and these may be just a coincidence. It is OK to generate hypotheses with this kind of exploratory research, but you need to test the hypothesis with different data. References 1. NP Rosman, T Colton, J Labazzo, PL Gilbert, NB Gardella, EM Kaye, C Van Bennekom, MR Winter. A controlled trial of diazepam administered during febrile illnesses to prevent recurrence of febrile seizures. N Engl J Med 329:79-84, 1993. The experimental therapy (diazepam) was compared with placebo because there is no standard therapy known to be effective. Phenobarbital was previously used routinely to prevent febrile seizures, but recent evidence has shown that it is not effective. 2. AR Feinstein, DA Sosin, CK Wetl,.,. Will Rogers phenomenon. Stage migration and new diagnostic techniques as a source of misleading statistics for survival in cancer. New Engl J Med 3l2:1604-1609, 1985 Visit the GraphPad home page |