Boosting: An Ensemble Learning Tool for Compound Classification and QSAR Modeling

20 April 2005

journal article
research article
Published by American Chemical Society (ACS) in Journal of Chemical Information and Modeling

Vol. 45 (3) , 786-799
https://doi.org/10.1021/ci0500379

Abstract

A classification and regression tool, J. H. Friedman's Stochastic Gradient Boosting (SGB), is applied to predicting a compound's quantitative or categorical biological activity based on a quantitative description of the compound's molecular structure. Stochastic Gradient Boosting is a procedure for building a sequence of models, for instance regression trees (as in this paper), whose outputs are combined to form a predicted quantity, either an estimate of the biological activity, or a class label to which a molecule belongs. In particular, the SGB procedure builds a model in a stage-wise manner by fitting each tree to the gradient of a loss function: e.g., squared error for regression and binomial log-likelihood for classification. The values of the gradient are computed for each sample in the training set, but only a random sample of these gradients is used at each stage. (Friedman showed that the well-known boosting algorithm, AdaBoost of Freund and Schapire, could be considered as a particular case of SGB.) The SGB method is used to analyze 10 cheminformatics data sets, most of which are publicly available. The results show that SGB's performance is comparable to that of Random Forest, another ensemble learning method, and are generally competitive with or superior to those of other QSAR methods. The use of SGB's variable importance with partial dependence plots for model interpretation is also illustrated.

Keywords

ENSEMBLE LEARNING

This publication has 32 references indexed in Scilit:

SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation
Nature Genetics, 2008
Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR
Journal of Chemical Information and Computer Sciences, 2004
Induction of Decision Trees via Evolutionary Programming
Journal of Chemical Information and Computer Sciences, 2004
Informative Library Design as an Efficient Strategy to Identify and Optimize Leads: Application to Cyclin-Dependent Kinase 2 Antagonists
Journal of Medicinal Chemistry, 2003
The support vector machine under test
Neurocomputing, 2003
Use of Robust Classification Techniques for the Prediction of Human Cytochrome P450 2D6 Inhibition
Journal of Chemical Information and Computer Sciences, 2003
Use of Recursion Forests in the Sequential Screening Process: Consensus Selection by Multiple Recursion Trees
Journal of Chemical Information and Computer Sciences, 2003
Decision Forest: Combining the Predictions of Multiple Independent Decision Tree Models
Journal of Chemical Information and Computer Sciences, 2003
A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Journal of Computer and System Sciences, 1997
Atom pairs as molecular features in structure-activity studies: definition and applications
Journal of Chemical Information and Computer Sciences, 1985