Analysis and Display of the Size Dependence of Chemical Similarity Coefficients
- 9 April 2003
- journal article
- Published by American Chemical Society (ACS) in Journal of Chemical Information and Computer Sciences
- Vol. 43 (3) , 819-828
- https://doi.org/10.1021/ci034001x
Abstract
We discuss the size-bias inherent in several chemical similarity coefficients when used for the similarity searching or diversity selection of compound collections. Limits to the upper bounds of 14 standard similarity coefficients are investigated, and the results are used to identify some exceptional characteristics of a few of the coefficients. An additional numerical contribution to the known size bias in the Tanimoto coefficient is identified. Graphical plots with respect to relative bit density are introduced to further assess the coefficients. Our methods reveal the asymmetries inherent in most similarity coefficients that lead to bias in selection, most notably with the Forbes and Russell-Rao coefficients. Conversely, when applied to the recently introduced Modified Tanimoto coefficient our methods provide support for the view that it is less biased toward molecular size than most. In this work we focus our discussion on fragment-based bit strings, but we demonstrate how our approach can be generalized to continuous representations.Keywords
This publication has 5 references indexed in Scilit:
- Evaluation of Similarity Measures for Searching the Dictionary of Natural Products DatabaseJournal of Chemical Information and Computer Sciences, 2003
- A Modification of the Jaccard–Tanimoto Similarity Index for Diverse Selection of Chemical Compounds Using Binary StringsTechnometrics, 2002
- Current trends in lead discovery: are we looking for the appropriate properties?Journal of Computer-Aided Molecular Design, 2002
- The Hidden Component of Size in Two-Dimensional Fragment Descriptors: Side Effects on Sampling in Bioactive LibrariesJournal of Medicinal Chemistry, 1999
- Chemical Similarity SearchingJournal of Chemical Information and Computer Sciences, 1998