One- to Four-Dimensional Kernels for Virtual Screening and the Prediction of Physical, Chemical, and Biological Properties
- 6 March 2007
- journal article
- Published by American Chemical Society (ACS) in Journal of Chemical Information and Modeling
- Vol. 47 (3) , 965-974
- https://doi.org/10.1021/ci600397p
Abstract
Many chemoinformatics applications, including high-throughput virtual screening, benefit from being able to rapidly predict the physical, chemical, and biological properties of small molecules to screen large repositories and identify suitable candidates. When training sets are available, machine learning methods provide an effective alternative to ab initio methods for these predictions. Here, we leverage rich molecular representations including 1D SMILES strings, 2D graphs of bonds, and 3D coordinates to derive efficient machine learning kernels to address regression problems. We further expand the library of available spectral kernels for small molecules developed for classification problems to include 2.5D surface and 3D kernels using Delaunay tetrahedrization and other techniques from computational geometry, 3D pharmacophore kernels, and 3.5D or 4D kernels capable of taking into account multiple molecular configurations, such as conformers. The kernels are comprehensively tested using cross-validation and redundancy-reduction methods on regression problems using several available data sets to predict boiling points, melting points, aqueous solubility, octanol/water partition coefficients, and biological activity with state-of-the art results. When sufficient training data are available, 2D spectral kernels in general tend to yield the best and most robust results, better than state-of-the art. On data sets containing thousands of molecules, the kernels achieve a squared correlation coefficient of 0.91 for aqueous solubility prediction and 0.94 for octanol/water partition coefficient prediction. Averaging over conformations improves the performance of kernels based on the three-dimensional structure of molecules, especially on challenging data sets. Kernel predictors for aqueous solubility (kSOL), LogP (kLOGP), and melting point (kMELT) are available over the Web through: http://cdb.ics.uci.edu.Keywords
This publication has 20 references indexed in Scilit:
- The Pharmacophore Kernel for Virtual Screening with Support Vector MachinesJournal of Chemical Information and Modeling, 2006
- Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activityBioinformatics, 2005
- Graph Kernels for Molecular Structure−Activity Relationship Analysis with Support Vector MachinesJournal of Chemical Information and Modeling, 2005
- Navigating chemical space for biology and medicineNature, 2004
- Mismatch string kernels for discriminative protein classificationBioinformatics, 2004
- Combinatorial informatics in the post-genomics eraNature Reviews Drug Discovery, 2002
- On the Properties of Bit String-Based Measures of Chemical SimilarityJournal of Chemical Information and Computer Sciences, 1998
- Three-dimensional alpha shapesACM Transactions on Graphics, 1994
- SMILES. 2. Algorithm for generation of unique SMILES notationJournal of Chemical Information and Computer Sciences, 1989
- Some results on Tchebycheffian spline functionsJournal of Mathematical Analysis and Applications, 1971