Prediction of Aqueous Solubility of Heteroatom-Containing Organic Compounds from Molecular Structure

Abstract
The use of quantitative structure−property relationships (QSPRs) to predict aqueous solubilities (log S) of heteroatom-containing organic compounds from their molecular structure is presented. Three data sets are examined. Data set 1 contains 176 compounds having one or more nitrogen atoms with some oxygen (log S[mol/L] range is −7.41 to 0.96). Data set 2 contains 223 compounds having one or more oxygen atoms, with no nitrogen (log S[mol/L] range is −8.77 to 1.57). Data set 3 contains all 399 compounds from sets 1 and 2 (log S/mol/L] range is −8.77 to 1.57). After descriptor generation and feature selection, multiple linear regression (MLR) and computational neural network (CNN) models are developed for aqueous solubility prediction. The best results were obtained with nonlinear CNN models. Root-mean-square (rms) errors for training with the three data sets ranged from 0.3 to 0.6 log units. All models were validated with external prediction sets, with the rms errors ranging from 0.6 log units to 1.5 log units.

This publication has 27 references indexed in Scilit: