Classification based upon gene expression data: bias and precision of error rates

Open Access

28 March 2007

journal article
review article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 23 (11) , 1363-1370
https://doi.org/10.1093/bioinformatics/btm117

Abstract

Motivation: Gene expression data offer a large number of potentially useful predictors for the classification of tissue samples into classes, such as diseased and non-diseased. The predictive error rate of classifiers can be estimated using methods such as cross-validation. We have investigated issues of interpretation and potential bias in the reporting of error rate estimates. The issues considered here are optimization and selection biases, sampling effects, measures of misclassification rate, baseline error rates, two-level external cross-validation and a novel proposal for detection of bias using the permutation mean. Results: Reporting an optimal estimated error rate incurs an optimization bias. Downward bias of 3–5% was found in an existing study of classification based on gene expression data and may be endemic in similar studies. Using a simulated non-informative dataset and two example datasets from existing studies, we show how bias can be detected through the use of label permutations and avoided using two-level external cross-validation. Some studies avoid optimization bias by using single-level cross-validation and a test set, but error rates can be more accurately estimated via two-level cross-validation. In addition to estimating the simple overall error rate, we recommend reporting class error rates plus where possible the conditional risk incorporating prior class probabilities and a misclassification cost matrix. We also describe baseline error rates derived from three trivial classifiers which ignore the predictors. Availability: R code which implements two-level external cross-validation with the PAMR package, experiment code, dataset details and additional figures are freely available for non-commercial use from http://www.maths.qut.edu.au/profiles/wood/permr.jsp Contact: i.wood@qut.edu.au Supplementary information: Supplementary data are available at Bioinformatics online.

Keywords

This publication has 20 references indexed in Scilit:

Classification of microarrays to nearest centroids
Bioinformatics, 2005
A protocol for building and evaluating predictors of disease state based on microarray data
Bioinformatics, 2005
A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis
Bioinformatics, 2004
Is cross-validation valid for small-sample microarray classification?
Bioinformatics, 2004
Classification in microarray experiments
Published by Taylor & Francis ,2003
Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays
Statistical Science, 2003
Diagnosis of multiple cancer types by shrunken centroids of gene expression
Proceedings of the National Academy of Sciences, 2002
Improvements on Cross-Validation: The .632+ Bootstrap Method
Journal of the American Statistical Association, 1997
Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation
Journal of the American Statistical Association, 1983
Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation
Journal of the American Statistical Association, 1983