An Information‐Theoretic Approach to Descriptor Selection for Database Profiling and QSAR Modeling

Abstract
In order to rationalize the selection of molecular descriptors for QSAR and other applications, we have adapted the Shannon entropy concept that was originally developed in digital communication theory. The approach has been extended to facilitate the large‐scale analysis of molecular descriptors and their information content in diverse compound databases. This has enabled us to identify descriptors with consistently high information content. Furthermore, it has been possible to select descriptors that are sensitive to systematic property differences in diverse compound collections (synthetic compounds, natural products, drug‐like molecules, or drugs) and, in addition, to quantify such database‐specific differences. Selection of descriptors based on information content has been proven useful for binary QSAR analysis. In this review, we describe the principles of entropy‐based descriptor selection and discuss different applications.