Classification based upon gene expression data: bias and precision of error rates
Open Access
- 28 March 2007
- journal article
- review article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 23 (11) , 1363-1370
- https://doi.org/10.1093/bioinformatics/btm117
Abstract
Motivation: Gene expression data offer a large number of potentially useful predictors for the classification of tissue samples into classes, such as diseased and non-diseased. The predictive error rate of classifiers can be estimated using methods such as cross-validation. We have investigated issues of interpretation and potential bias in the reporting of error rate estimates. The issues considered here are optimization and selection biases, sampling effects, measures of misclassification rate, baseline error rates, two-level external cross-validation and a novel proposal for detection of bias using the permutation mean. Results: Reporting an optimal estimated error rate incurs an optimization bias. Downward bias of 3–5% was found in an existing study of classification based on gene expression data and may be endemic in similar studies. Using a simulated non-informative dataset and two example datasets from existing studies, we show how bias can be detected through the use of label permutations and avoided using two-level external cross-validation. Some studies avoid optimization bias by using single-level cross-validation and a test set, but error rates can be more accurately estimated via two-level cross-validation. In addition to estimating the simple overall error rate, we recommend reporting class error rates plus where possible the conditional risk incorporating prior class probabilities and a misclassification cost matrix. We also describe baseline error rates derived from three trivial classifiers which ignore the predictors. Availability: R code which implements two-level external cross-validation with the PAMR package, experiment code, dataset details and additional figures are freely available for non-commercial use from http://www.maths.qut.edu.au/profiles/wood/permr.jsp Contact: i.wood@qut.edu.au Supplementary information: Supplementary data are available at Bioinformatics online.Keywords
This publication has 20 references indexed in Scilit:
- Classification of microarrays to nearest centroidsBioinformatics, 2005
- A protocol for building and evaluating predictors of disease state based on microarray dataBioinformatics, 2005
- A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosisBioinformatics, 2004
- Is cross-validation valid for small-sample microarray classification?Bioinformatics, 2004
- Classification in microarray experimentsPublished by Taylor & Francis ,2003
- Class Prediction by Nearest Shrunken Centroids, with Applications to DNA MicroarraysStatistical Science, 2003
- Diagnosis of multiple cancer types by shrunken centroids of gene expressionProceedings of the National Academy of Sciences, 2002
- Improvements on Cross-Validation: The .632+ Bootstrap MethodJournal of the American Statistical Association, 1997
- Estimating the Error Rate of a Prediction Rule: Improvement on Cross-ValidationJournal of the American Statistical Association, 1983
- Estimating the Error Rate of a Prediction Rule: Improvement on Cross-ValidationJournal of the American Statistical Association, 1983