An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests.
Top Cited Papers
- 1 December 2009
- journal article
- review article
- Published by American Psychological Association (APA) in Psychological Methods
- Vol. 14 (4) , 323-348
- https://doi.org/10.1037/a0016973
Abstract
Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields. Especially random forests, which can deal with large numbers of predictor variables even in the presence of complex interactions, have been applied successfully in genetics, clinical medicine, and bioinformatics within the past few years. High-dimensional problems are common not only in genetics, but also in some areas of psychological research, where only a few subjects can be measured because of time or cost constraints, yet a large amount of data is generated for each subject. Random forests have been shown to achieve a high prediction accuracy in such applications and to provide descriptive variable importance measures reflecting the impact of each variable in both main effects and interactions. The aim of this work is to introduce the principles of the standard recursive partitioning methods as well as recent methodological improvements, to illustrate their usage for low and high-dimensional data exploration, but also to point out limitations of the methods and potential pitfalls in their practical application. Application of the methods is illustrated with freely available implementations in the R system for statistical computing.Keywords
This publication has 49 references indexed in Scilit:
- Variables associated with familial suicide attempts in a sample of suicide attemptersProgress in Neuro-Psychopharmacology and Biological Psychiatry, 2007
- A Feature Selection Method for Multilevel Mental Fatigue EEG ClassificationIEEE Transactions on Biomedical Engineering, 2007
- Posttraumatic stress disorder: Diagnostic data analysis by data mining methodology2007
- Bias in random forest variable importance measures: Illustrations, sources and a solutionBMC Bioinformatics, 2007
- Evaluation of different biological data and computational classification methods for use in protein interaction predictionProteins-Structure Function and Bioinformatics, 2006
- Screening large-scale association study data: exploiting interactions using random forestsBMC Genomic Data, 2004
- Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortalityJournal of Clinical Epidemiology, 2004
- Relating HIV-1 Sequence Variation to Replication Capacity via Trees and ForestsStatistical Applications in Genetics and Molecular Biology, 2004
- The dominance analysis approach for comparing predictors in multiple regression.Psychological Methods, 2003
- Criticality of predictors in multiple regressionBritish Journal of Mathematical and Statistical Psychology, 2001