A unified statistical framework for sequence comparison and structure comparison
Open Access
- 26 May 1998
- journal article
- research article
- Published by Proceedings of the National Academy of Sciences in Proceedings of the National Academy of Sciences
- Vol. 95 (11) , 5913-5920
- https://doi.org/10.1073/pnas.95.11.5913
Abstract
We present an approach for assessing the significance of sequence and structure comparisons by using nearly identical statistical formalisms for both sequence and structure. Doing so involves an all-vs.-all comparison of protein domains [taken here from the Structural Classification of Proteins (scop) database] and then fitting a simple distribution function to the observed scores. By using this distribution, we can attach a statistical significance to each comparison score in the form of a P value, the probability that a better score would occur by chance. As expected, we find that the scores for sequence matching follow an extreme-value distribution. The agreement, moreover, between the P values that we derive from this distribution and those reported by standard programs (e.g., blast and fasta validates our approach. Structure comparison scores also follow an extreme-value distribution when the statistics are expressed in terms of a structural alignment score (essentially the sum of reciprocated distances between aligned atoms minus gap penalties). We find that the traditional metric of structural similarity, the rms deviation in atom positions after fitting aligned atoms, follows a different distribution of scores and does not perform as well as the structural alignment score. Comparison of the sequence and structure statistics for pairs of proteins known to be related distantly shows that structural comparison is able to detect approximately twice as many distant relationships as sequence comparison at the same error rate. The comparison also indicates that there are very few pairs with significant similarity in terms of sequence but not structure whereas many pairs have significant similarity in terms of structure but not sequence.Keywords
This publication has 41 references indexed in Scilit:
- SCOP: A structural classification of proteins database for the investigation of sequences and structuresPublished by Elsevier ,2006
- Identification of common molecular subsequencesPublished by Elsevier ,2004
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997
- A Surface of Minimum Area Metric for the Structural Comparison of ProteinsJournal of Molecular Biology, 1996
- Average Core Structures and Variability Measures for Protein Families: Application to the ImmunoglobulinsJournal of Molecular Biology, 1995
- Definition of general topological equivalence in protein structuresJournal of Molecular Biology, 1990
- Protein structure alignmentJournal of Molecular Biology, 1989
- Phosphocholine binding immunoglobulin Fab McPC603Journal of Molecular Biology, 1986
- A systematic approach to the comparison of protein structuresJournal of Molecular Biology, 1980
- The protein data bank: A computer-based archival file for macromolecular structuresJournal of Molecular Biology, 1977