Analyzing bagging

Top Cited Papers

Open Access

1 August 2002

journal article
Published by Institute of Mathematical Statistics in The Annals of Statistics

Vol. 30 (4) , 927-961
https://doi.org/10.1214/aos/1031689014

Abstract

Bagging is one of the most effective computationally intensive procedures to im- prove on unstable estimators or classifiers, useful especially for high dimensional data set problems. Here we formalize the notion of instability and derive theoretical results to analyze the variance reduction effect of bagging (or its variant) in mainly hard decision problems, which include estimation after testing in regression and decision trees for continuous regression functions and classifiers. Hard decisions create insta- bility, and bagging is shown to smooth such hard decisions yielding smaller variance and mean squared error. With theoretical explanations, we motivate subagging based on subsampling as an alternative aggregation scheme. It is computationally cheaper but still showing approximately the same accuracy as bagging. Moreover, our theory reveals improvements in first order and in line with simulation studies. In particular, we obtain an asymptotic limiting distribution at the cube-root rate for the split point when fitting piecewise constant functions. Denoting sample size by n, it follows that in a cylindric neighborhood of diameter n 1/3 of the theoretically optimal split point, the variance and mean squared error reduction of subagging can be characterized analytically. Because of the slow rate, our reasoning also provides an explanation on the global scale for the whole covariate space in a decision tree with finitely many splits.

Keywords

This publication has 16 references indexed in Scilit:

On bagging and nonlinear estimation
Journal of Statistical Planning and Inference, 2007
Limiting properties of the least squares estimator of a continuous threshold autoregressive model
Biometrika, 1998
Arcing classifier (with discussion and a rejoinder by the author)
The Annals of Statistics, 1998
Shape Quantization and Recognition with Randomized Trees
Neural Computation, 1997
Heuristics of instability and stabilization in model selection
The Annals of Statistics, 1996
Flexible Discriminant Analysis by Optimal Scoring
Journal of the American Statistical Association, 1994
Multivariate Adaptive Regression Splines
The Annals of Statistics, 1991
Bootstrapping General Empirical Measures
The Annals of Probability, 1990
Cube Root Asymptotics
The Annals of Statistics, 1990
Bootstrapping Regression Models
The Annals of Statistics, 1981