Enrichment of High-Throughput Screening Data with Increasing Levels of Noise Using Support Vector Machines, Recursive Partitioning, and Laplacian-Modified Naive Bayesian Classifiers
- 3 December 2005
- journal article
- research article
- Published by American Chemical Society (ACS) in Journal of Chemical Information and Modeling
- Vol. 46 (1) , 193-200
- https://doi.org/10.1021/ci050374h
Abstract
High-throughput screening (HTS) plays a pivotal role in lead discovery for the pharmaceutical industry. In tandem, cheminformatics approaches are employed to increase the probability of the identification of novel biologically active compounds by mining the HTS data. HTS data is notoriously noisy, and therefore, the selection of the optimal data mining method is important for the success of such an analysis. Here, we describe a retrospective analysis of four HTS data sets using three mining approaches: Laplacian-modified naive Bayes, recursive partitioning, and support vector machine (SVM) classifiers with increasing stochastic noise in the form of false positives and false negatives. All three of the data mining methods at hand tolerated increasing levels of false positives even when the ratio of misclassified compounds to true active compounds was 5:1 in the training set. False negatives in the ratio of 1:1 were tolerated as well. SVM outperformed the other two methods in capturing active compounds and scaffolds in the top 1%. A Murcko scaffold analysis could explain the differences in enrichments among the four data sets. This study demonstrates that data mining methods can add a true value to the screen even when the data is contaminated with a high level of stochastic noise.Keywords
This publication has 20 references indexed in Scilit:
- Practical Approaches to Efficient Screening: Information-Rich Screening ProtocolSLAS Discovery, 2004
- A Flexible Data Analysis Tool for Chemical Genetic ScreensChemistry & Biology, 2004
- Deriving Knowledge through Data Mining High-Throughput Screening DataJournal of Medicinal Chemistry, 2004
- Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structuresOrganic & Biomolecular Chemistry, 2004
- An Information‐Theoretic Approach to Descriptor Selection for Database Profiling and QSAR ModelingQSAR & Combinatorial Science, 2003
- Active Learning with Support Vector Machines in the Drug Discovery ProcessJournal of Chemical Information and Computer Sciences, 2003
- Decision Forest: Combining the Predictions of Multiple Independent Decision Tree ModelsJournal of Chemical Information and Computer Sciences, 2003
- Similarity Searching in Files of Three-Dimensional Chemical Structures: Analysis of the BIOSTER Database Using Two-Dimensional Fingerprints and Molecular Field DescriptorsJournal of Chemical Information and Computer Sciences, 1999
- Comparing 3D Pharmacophore Triplets and 2D Fingerprints for Selecting Diverse Compound SubsetsJournal of Chemical Information and Computer Sciences, 1999
- The Properties of Known Drugs. 1. Molecular FrameworksJournal of Medicinal Chemistry, 1996