Variable importance in binary regression trees and forests
Top Cited Papers
Open Access
- 1 January 2007
- journal article
- research article
- Published by Institute of Mathematical Statistics in Electronic Journal of Statistics
- Vol. 1 (none) , 519-537
- https://doi.org/10.1214/07-ejs039
Abstract
We characterize and study variable importance (VIMP) and pairwise variable associations in binary regression trees. A key component involves the node mean squared error for a quantity we refer to as a maximal subtree. The theory naturally extends from single trees to ensembles of trees and applies to methods like random forests. This is useful because while importance values from random forests are used to screen variables, for example they are used to filter high throughput genomic data in Bioinformatics, very little theory exists about their properties.Keywords
All Related Versions
This publication has 6 references indexed in Scilit:
- SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivationNature Genetics, 2008
- Bias in random forest variable importance measures: Illustrations, sources and a solutionBMC Bioinformatics, 2007
- Gene selection and classification of microarray data using random forestBMC Bioinformatics, 2006
- Screening large-scale association study data: exploiting interactions using random forestsBMC Genomic Data, 2004
- Identifying SNPs predictive of phenotype using random forestsGenetic Epidemiology, 2004
- Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author)Statistical Science, 2001