New feature subset selection procedures for classification of expression profiles
Open Access
- 14 March 2002
- journal article
- research article
- Published by Springer Nature in Genome Biology
Abstract
Methods for extracting useful information from the datasets produced by microarray experiments are at present of much interest. Here we present new methods for finding gene sets that are well suited for distinguishing experiment classes, such as healthy versus diseased tissues. Our methods are based on evaluating genes in pairs and evaluating how well a pair in combination distinguishes two experiment classes. We tested the ability of our pair-based methods to select gene sets that generalize the differences between experiment classes and compared the performance relative to two standard methods. To assess the ability to generalize class differences, we studied how well the gene sets we select are suited for learning a classifier. We show that the gene sets selected by our methods outperform the standard methods, in some cases by a large margin, in terms of cross-validation prediction accuracy of the learned classifier. We show that on two public datasets, accurate diagnoses can be made using only 15-30 genes. Our results have implications for how to select marker genes and how many gene measurements are needed for diagnostic purposes. When looking for differential expression between experiment classes, it may not be sufficient to look at each gene in a separate universe. Evaluating combinations of genes reveals interesting information that will not be discovered otherwise. Our results show that class prediction can be improved by taking advantage of this extra information.Keywords
This publication has 12 references indexed in Scilit:
- Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression DataJournal of the American Statistical Association, 2002
- Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networksNature Medicine, 2001
- J-Express: exploring gene expression data using JavaBioinformatics, 2001
- Identifying marker genes in transcription profiling data using a mixture of feature relevance expertsPhysiological Genomics, 2001
- Computational Methods for Gene Expression-Based Tumor ClassificationBioTechniques, 2000
- Coupled two-way clustering analysis of gene microarray dataProceedings of the National Academy of Sciences, 2000
- Tissue Classification with Gene Expression ProfilesJournal of Computational Biology, 2000
- Distinct types of diffuse large B-cell lymphoma identified by gene expression profilingNature, 2000
- Wrappers for feature subset selectionArtificial Intelligence, 1997
- Multivariate Analysis.Journal of the Royal Statistical Society Series C: Applied Statistics, 1981