Boosting: An Ensemble Learning Tool for Compound Classification and QSAR Modeling
- 20 April 2005
- journal article
- research article
- Published by American Chemical Society (ACS) in Journal of Chemical Information and Modeling
- Vol. 45 (3) , 786-799
- https://doi.org/10.1021/ci0500379
Abstract
A classification and regression tool, J. H. Friedman's Stochastic Gradient Boosting (SGB), is applied to predicting a compound's quantitative or categorical biological activity based on a quantitative description of the compound's molecular structure. Stochastic Gradient Boosting is a procedure for building a sequence of models, for instance regression trees (as in this paper), whose outputs are combined to form a predicted quantity, either an estimate of the biological activity, or a class label to which a molecule belongs. In particular, the SGB procedure builds a model in a stage-wise manner by fitting each tree to the gradient of a loss function: e.g., squared error for regression and binomial log-likelihood for classification. The values of the gradient are computed for each sample in the training set, but only a random sample of these gradients is used at each stage. (Friedman showed that the well-known boosting algorithm, AdaBoost of Freund and Schapire, could be considered as a particular case of SGB.) The SGB method is used to analyze 10 cheminformatics data sets, most of which are publicly available. The results show that SGB's performance is comparable to that of Random Forest, another ensemble learning method, and are generally competitive with or superior to those of other QSAR methods. The use of SGB's variable importance with partial dependence plots for model interpretation is also illustrated.Keywords
This publication has 32 references indexed in Scilit:
- SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivationNature Genetics, 2008
- Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSARJournal of Chemical Information and Computer Sciences, 2004
- Induction of Decision Trees via Evolutionary ProgrammingJournal of Chemical Information and Computer Sciences, 2004
- Informative Library Design as an Efficient Strategy to Identify and Optimize Leads: Application to Cyclin-Dependent Kinase 2 AntagonistsJournal of Medicinal Chemistry, 2003
- The support vector machine under testNeurocomputing, 2003
- Use of Robust Classification Techniques for the Prediction of Human Cytochrome P450 2D6 InhibitionJournal of Chemical Information and Computer Sciences, 2003
- Use of Recursion Forests in the Sequential Screening Process: Consensus Selection by Multiple Recursion TreesJournal of Chemical Information and Computer Sciences, 2003
- Decision Forest: Combining the Predictions of Multiple Independent Decision Tree ModelsJournal of Chemical Information and Computer Sciences, 2003
- A Decision-Theoretic Generalization of On-Line Learning and an Application to BoostingJournal of Computer and System Sciences, 1997
- Atom pairs as molecular features in structure-activity studies: definition and applicationsJournal of Chemical Information and Computer Sciences, 1985