Censored Survival Data with Misclassified Covariates: A Case Study of Breast-Cancer Mortality

Abstract
From a cancer registry in the San Francisco Bay area we obtained survival data for 2,495 women diagnosed with breast cancer at ages 55–64. We relate mortality among these women to the time since diagnosis and to the stage of the disease at diagnosis. We divide the study period, extending through 10 years, into five two-year periods, and for each stage we assume a constant hazard rate during each of these periods. Let λ = (λ jk ) be the J × K matrix of hazard rates for the J = 5 periods and K = 5 stages. The most general model allows λ jk to vary freely. A plot of maximum likelihood estimates of the hazard rates shows some tendency for increase with stage, but no simple patterns or parallelism across stage. We seek more restrictive models, to get simpler interpretations. The exponential model assumes that although λ jk may vary with stage, it is constant over the five periods for each stage. This model, which assumes no dependence of hazard rate on time since diagnosis, is quite restrictive, and indeed the likelihood ratio test of the exponential versus the general model rejects it strongly. Not quite as restrictive as the exponential model is a proportional-hazards model, which assumes that the log-hazard rates for the first four stages are parallel. Nevertheless, the likelihood ratio test of this model versus the general model rejects it as well. We explore the possibility that one of the more restrictive models is appropriate but that the bad fit is due to errors in staging. To do so, we replace the aforementioned models with ones that accommodate stage misclassification. Using the EM algorithm to compute maximum likelihood estimates and likelihood ratio statistics, we find that the exponential model is again rejected, but that the proportional-hazards model fits the data. This example shows that simple models with straightforward interpretations might be discarded needlessly if covariate misclassifications are ignored. Simulations support this possibility. When data are generated according to a proportional-hazards model with stage misclassifications, ignoring the misclassification can result in missing the proportional-hazards model. Simulations revealed other points. In particular, large samples are needed to detect classification errors. In addition, misclassification models give hazard-rate estimates that can be much more variable than those of models without misclassification.

This publication has 0 references indexed in Scilit: