Predicting the Genotoxicity of Secondary and Aromatic Amines Using Data Subsetting To Generate a Model Ensemble
- 30 April 2003
- journal article
- Published by American Chemical Society (ACS) in Journal of Chemical Information and Computer Sciences
- Vol. 43 (3) , 949-963
- https://doi.org/10.1021/ci034013i
Abstract
Binary quantitative structure−activity relationship (QSAR) models are developed to classify a data set of 334 aromatic and secondary amine compounds as genotoxic or nongenotoxic based on information calculated solely from chemical structure. Genotoxic endpoints for each compound were determined using the SOS Chromotest in both the presence and absence of an S9 rat liver homogenate. Compounds were considered genotoxic if assay results indicated a positive genotoxicity hit for either the S9 inactivated or S9 activated assay. Each compound in the data set was encoded through the calculation of numerical descriptors that describe various aspects of chemical structure (e.g. topological, geometric, electronic, polar surface area). Furthermore, five additional descriptors that focused on the secondary and aromatic nitrogen atoms in each molecule were calculated specifically for this study. Descriptor subsets were examined using a genetic algorithm search engine interfaced with a k-Nearest Neighbor fitness evaluator to find the most information-rich subsets, which ultimately served as the final predictive models. Models were chosen for their ability to minimize the total number of misclassifications, with special attention given to those models that possessed fewer occurrences of positive toxicity hits being misclassified as nontoxic (false negatives). In addition, a subsetting procedure was used to form an ensemble of models using different combinations of compounds in the training and prediction sets. This was done to ensure that consistent results could be obtained regardless of training set composition. The procedure also allowed for each compound to be externally validated three times by different training set data with the resultant predictions being used in a “majority rules” voting scheme to produce a consensus prediction for each member of the data set. The individual models produced an average training set classification rate of 71.6% and an average prediction set classification rate of 67.7%. However, the model ensemble was able to correctly classify the genotoxicity of 72.2% of all prediction set compounds.Keywords
This publication has 39 references indexed in Scilit:
- THE SOS RESPONSE: Recent Insights into umuDC-Dependent Mutagenesis and DNA Damage ToleranceAnnual Review of Genetics, 2000
- Non-Linear QSAR Treatment of GenotoxicityMolecular Simulation, 2000
- Prediction of Acute Mammalian Toxicity of Organophosphorus Pesticide Compounds from Molecular StructureSAR and QSAR in Environmental Research, 1999
- QSAR models for both mutagenic potency and activity: Application to nitroarenes and aromatic aminesEnvironmental and Molecular Mutagenesis, 1994
- Genotoxicity of aniline derivatives in various short-term testsMutation Research - Fundamental and Molecular Mechanisms of Mutagenesis, 1989
- Computer-assisted studies of molecular structure and genotoxic activity by pattern recognition techniques.Environmental Health Perspectives, 1985
- Atom pairs as molecular features in structure-activity studies: definition and applicationsJournal of Chemical Information and Computer Sciences, 1985
- Computer-assisted structure-activity studies of chemical carcinogens: A polycyclic aromatic hydrocarbon data setToxicology and Applied Pharmacology, 1980
- Raman Spectra of Aqueous Solutions of Potassium ThiocyanateJournal of the American Chemical Society, 1947
- The Effect of Structure upon the Reactions of Organic Compounds. Benzene DerivativesJournal of the American Chemical Society, 1937