Genetic test bed for feature selection

Open Access

20 January 2006

journal article
research article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 22 (7) , 837-842
https://doi.org/10.1093/bioinformatics/btl008

Abstract

Motivation: Given a large set of potential features, such as the set of all gene-expression values from a microarray, it is necessary to find a small subset with which to classify. The task of finding an optimal feature set of a given size is inherently combinatoric because to assure optimality all feature sets of a given size must be checked. Thus, numerous suboptimal feature-selection algorithms have been proposed. There are strong impediments to evaluate feature-selection algorithms using real data when data are limited, a common situation in genetic classification. The difficulty is compound. First, there are no class-conditional distributions from which to draw data points, only a single small labeled sample. Second, there are no test data with which to estimate the feature-set errors, and one must depend on a training-data-based error estimator. Finally, there is no optimal feature set with which to compare the feature sets found by the algorithms. Results: This paper describes a genetic test bed for the evaluation of feature-selection algorithms. It begins with a large biological feature-label dataset that is used as an empirical distribution and, using massively parallel computation, finds the top feature sets of various sizes based on a given sample size and classification rule. The user can draw random samples from the data, apply a proposed algorithm, and evaluate the proficiency of the proposed algorithm via three different measures (code provided). A key feature of the test bed is that, once a dataset is input, a single command creates the entire test bed relative to the dataset. The particular dataset used for the first version of the test bed comes from a microarray-based classification study that analyzes a large number of microarrays, prepared with RNA from breast tumor samples from each of 295 patients. Availability: The software and supplementary material are available at Contact:edward@ece.tamu.edu

Keywords

This publication has 24 references indexed in Scilit:

Impact of error estimation on feature selection
Pattern Recognition, 2005
Determination of the optimal number of features for quadratic discriminant analysis via the normal approximation to the discriminant distribution
Pattern Recognition, 2005
Optimal number of features as a function of sample size for various classification rules
Bioinformatics, 2004
Bolstered error estimation
Pattern Recognition, 2004
A Gene-Expression Signature as a Predictor of Survival in Breast Cancer
New England Journal of Medicine, 2002
Gene expression profiling predicts clinical outcome of breast cancer
Nature, 2002
Comparison of algorithms that select features for pattern classifiers
Pattern Recognition, 2000
Floating search methods in feature selection
Pattern Recognition Letters, 1994
Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation
Journal of the American Statistical Association, 1983
On dimensionality and sample size in statistical pattern classification
Pattern Recognition, 1971