Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings
Open Access
- 1 January 2007
- journal article
- research article
- Published by Springer Nature in EURASIP Journal on Bioinformatics and Systems Biology
- Vol. 2007 (1) , 1-12
- https://doi.org/10.1155/2007/38473
Abstract
The aim of many microarray experiments is to build discriminatory diagnosis and prognosis models. Given the huge number of features and the small number of examples, model validity which refers to the precision of error estimation is a critical issue. Previous studies have addressed this issue via the deviation distribution (estimated error minus true error), in particular, the deterioration of cross-validation precision in high-dimensional settings where feature selection is used to mitigate the peaking phenomenon (overfitting). Because classifier design is based upon random samples, both the true and estimated errors are sample-dependent random variables, and one would expect a loss of precision if the estimated and true errors are not well correlated, so that natural questions arise as to the degree of correlation and the manner in which lack of correlation impacts error estimation. We demonstrate the effect of correlation on error precision via a decomposition of the variance of the deviation distribution, observe that the correlation is often severely decreased in high-dimensional settings, and show that the effect of high dimensionality on error estimation tends to result more from its decorrelating effects than from its impact on the variance of the estimated error. We consider the correlation between the true and estimated errors under different experimental conditions using both synthetic and real data, several feature-selection methods, different classification rules, and three error estimators commonly used (leave-one-out cross-validation, k-fold cross-validation, and .632 bootstrap). Moreover, three scenarios are considered: (1) feature selection, (2) known-feature set, and (3) all features. Only the first is of practical interest; however, the other two are needed for comparison purposes. We will observe that the true and estimated errors tend to be much more correlated in the case of a known feature set than with either feature selection or using all features, with the better correlation between the latter two showing no general trend, but differing for different models.Keywords
This publication has 21 references indexed in Scilit:
- Quantification of the Impact of Feature Selection on the Variance of Cross-Validation Error EstimationEURASIP Journal on Bioinformatics and Systems Biology, 2007
- Impact of error estimation on feature selectionPattern Recognition, 2005
- Prediction error estimation: a comparison of resampling methodsBioinformatics, 2005
- Optimal number of features as a function of sample size for various classification rulesBioinformatics, 2004
- Superior feature-set ranking for small samples using bolstered error estimationBioinformatics, 2004
- Is cross-validation valid for small-sample microarray classification?Bioinformatics, 2004
- Is cross-validation better than resubstitution for ranking genes?Bioinformatics, 2004
- Small sample issues for microarray‐based classificationComparative and Functional Genomics, 2001
- Feature selection: evaluation, application, and small sample performancePublished by Institute of Electrical and Electronics Engineers (IEEE) ,1997
- On the mean accuracy of statistical pattern recognizersIEEE Transactions on Information Theory, 1968