Molecular diagnosis - Classification, model selection and performance evaluation

1 January 2005

journal article
research article

Vol. 44 (3) , 438-443

Abstract

Objectives. We discuss supervised classification techniques applied to medical diagnosis based on gene expression profiles. Our focus lies on strategies of adaptive model selection to avoid overfitting in high-dimensional spaces. Methods: We introduce likelihood-based methods, classification trees, support vector machines and regularized binary regression. For regularization by dimension reduction, we describe feature selection methods: feature filtering, feature shrinkage and wrapper approaches. In small sample-size situations efficient methods of data re-use are needed to assess the predictive power of a model. We discuss two issues in using cross-validation: the difference between in-loop and out-of-loop feature selection, and estimating model parameters in nested-loop cross-validation. Results: Gene selection does not reduce the dimensionality of the model. Tuning parameters enable adaptive model selection. The feature selection bias is a common pitfall in performance evaluation. Model selection and performance evaluation can be combined by nested-loop cross-validation. Conclusions. Classification of microarrays is prone to overfitting. A rigorous and unbiased assessment of the predictive power of the model is a must.

Keywords

This publication has 8 references indexed in Scilit:

SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation
Nature Genetics, 2008
Rules of evidence for cancer molecular-marker discovery and validation
Nature Reviews Cancer, 2004
The Generalized LASSO
IEEE Transactions on Neural Networks, 2004
Treatment-specific changes in gene expression discriminate in vivo drug response in human leukemia cells
Nature Genetics, 2003
IMPROVED GENE SELECTION FOR CLASSIFICATION OF MICROARRAYS
Pacific Symposium on Biocomputing, 2002
Wrappers for feature subset selection
Artificial Intelligence, 1997
Selection of relevant features and examples in machine learning
Artificial Intelligence, 1997
Regression Shrinkage and Selection Via the Lasso
Journal of the Royal Statistical Society Series B: Statistical Methodology, 1996