Specifying and Implementing Nonparametric and Semiparametric Survival Estimators in Two-Stage (Nested) Cohort Studies With Missing Case Data
- 1 June 2006
- journal article
- Published by Taylor & Francis in Journal of the American Statistical Association
- Vol. 101 (474) , 460-471
- https://doi.org/10.1198/016214505000000952
Abstract
Since 1986, we have been studying a cohort of individuals from a region in China with epidemic rates of gastric cardia cancer and have conducted numerous two-stage studies to assess the association of various exposures with this cancer. Two-stage studies are a commonly used statistical design. Stage one involves observing the outcomes and accessible baseline covariate information on all cohort members, and stage two involves using the stage one observations to select a subset of the cohort for measurements of exposures that are difficult to obtain. When the outcomes are censored failure times, such as in our studies, the most common designs used are the case-cohort and nested case-control designs. One limitation of both these designs is that the estimators of the cumulative hazards, and hence survivals and absolute risks, are biased when some cases are missing the stage two measurements. In our experience, such missingness is present in virtually all two-stage studies that (like ours) use biological specimens to obtain exposure measurements. In earlier work we derived and characterized the efficiency of a class of nonparametric and a class of semiparametric cumulative hazard estimators that are unbiased regardless of whether or not all cases are measured. In this article we limit the presentation of the mathematical derivation of these two classes to aspects important to study design and analysis. We analyze data from a two-stage study that we conducted on the association of Helicobacter pylori infection with incident gastric cardia cancers. We discuss the substantive reasons why we deliberately sampled only 25% of the available cancer cases. Through simulations, we demonstrate that substantial variation in precision exists between unbiased estimators within each class, and express the origin of these differences in terms of parameters familiar to investigators. We describe how preexistent knowledge about these parameters can be used to increase estimator precision, and detail specific strategies for constructing such estimators. Computer code in R that implements these estimators is available from the authors on request.Keywords
This publication has 21 references indexed in Scilit:
- Zinc Concentration in Esophageal Biopsy Specimens Measured by X-Ray Fluorescence and Esophageal Cancer RiskJNCI Journal of the National Cancer Institute, 2005
- Hypothesis: The Changing Relationships ofHelicobacter pyloriand Humans: Implications for Health and DiseaseThe Journal of Infectious Diseases, 1999
- R: A Language for Data Analysis and GraphicsJournal of Computational and Graphical Statistics, 1996
- Methods for the Analysis of Sampled Cohort Data in the Cox Proportional Hazards ModelThe Annals of Statistics, 1995
- Estimation of Regression Coefficients When Some Regressors Are Not Always ObservedJournal of the American Statistical Association, 1994
- Nutrition Intervention Trials in Linxian, China: Supplementation With Specific Vitamin/Mineral Combinations, Cancer Incidence, and Disease-Specific Mortality in the General PopulationJNCI Journal of the National Cancer Institute, 1993
- Semiparametric efficiency boundsJournal of Applied Econometrics, 1990
- Asymptotic Distribution Theory and Efficiency Results for Case-Cohort StudiesThe Annals of Statistics, 1988
- Inference and missing dataBiometrika, 1976
- A Generalization of Sampling Without Replacement From a Finite UniverseJournal of the American Statistical Association, 1952