Abstract
This paper considers the relationship between the percentage sequence identities of protein chains and the molecular similarities of the ligands they bind. Among a set of alpha helical proteins from the PDB, it is found that related proteins tend to bind similar ligands. Furthermore, the property of binding similar ligands can be used to define the categories of “like” and “unlike” pairs of protein chains, separated by an approximate cutoff at a sequence identity of, or somewhat above, 45%. Similarly, the property of binding related protein chains can be used to define “low” and “high” similarity pairs of ligand residues, with a cutoff at a Tanimoto score of 0.70. The ligands bound to two “like” protein chains are five times more likely to be of high similarity than would be expected if protein sequence identity and ligand molecular similarity were independent variables. Nonetheless, the nature of the PDB means that it is unclear whether the same conclusions would be reached with a data set representing an unbiased sample of all protein−ligand complexes in a living cell. The construction of an appropriate data set for such a study represents a significant challenge.