A Difficulty Information Approach to Substituent Selection in QSAR Studies

Abstract
In the development of quantitative structure-activity relationships (QSAR), a small subset of chemical compounds must be chosen for synthesis from a much larger population of potentially bioactive molecules. The ultimate goal of the QSAR study is to determine, at the lowest cost, the most biologically active member of the population. Hence it is important that the sample be optimally selected for both predictive ability and ease of synthesis. This article describes a method, based on information theory, that simultaneously incorporates these concerns into the substituent selection process. This procedure is essentially a generalization of previous algorithms for obtaining a D-optimal design by examining prediction variances (Mitchell 1974). Results from applying this difficulty-information approach suggest that the method is capable of achieving large decreases in total synthesis difficulty at the expense of only a moderate decrease in predictive ability.