An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests.

Top Cited Papers

1 December 2009

journal article
review article
Published by American Psychological Association (APA) in Psychological Methods

Vol. 14 (4) , 323-348
https://doi.org/10.1037/a0016973

Abstract

Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields. Especially random forests, which can deal with large numbers of predictor variables even in the presence of complex interactions, have been applied successfully in genetics, clinical medicine, and bioinformatics within the past few years. High-dimensional problems are common not only in genetics, but also in some areas of psychological research, where only a few subjects can be measured because of time or cost constraints, yet a large amount of data is generated for each subject. Random forests have been shown to achieve a high prediction accuracy in such applications and to provide descriptive variable importance measures reflecting the impact of each variable in both main effects and interactions. The aim of this work is to introduce the principles of the standard recursive partitioning methods as well as recent methodological improvements, to illustrate their usage for low and high-dimensional data exploration, but also to point out limitations of the methods and potential pitfalls in their practical application. Application of the methods is illustrated with freely available implementations in the R system for statistical computing.

Keywords

This publication has 49 references indexed in Scilit:

Variables associated with familial suicide attempts in a sample of suicide attempters
Progress in Neuro-Psychopharmacology and Biological Psychiatry, 2007
A Feature Selection Method for Multilevel Mental Fatigue EEG Classification
IEEE Transactions on Biomedical Engineering, 2007
Posttraumatic stress disorder: Diagnostic data analysis by data mining methodology
2007
Bias in random forest variable importance measures: Illustrations, sources and a solution
BMC Bioinformatics, 2007
Evaluation of different biological data and computational classification methods for use in protein interaction prediction
Proteins-Structure Function and Bioinformatics, 2006
Screening large-scale association study data: exploiting interactions using random forests
BMC Genomic Data, 2004
Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality
Journal of Clinical Epidemiology, 2004
Relating HIV-1 Sequence Variation to Replication Capacity via Trees and Forests
Statistical Applications in Genetics and Molecular Biology, 2004
The dominance analysis approach for comparing predictors in multiple regression.
Psychological Methods, 2003
Criticality of predictors in multiple regression
British Journal of Mathematical and Statistical Psychology, 2001