Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements
Top Cited Papers
- 15 July 2001
- journal article
- review article
- Published by Oxford University Press (OUP) in Nucleic Acids Research
- Vol. 29 (14) , 2994-3005
- https://doi.org/10.1093/nar/29.14.2994
Abstract
PSI-BLAST is an iterative program to search a database for proteins with distant similarity to a query sequence. We investigated over a dozen modifications to the methods used in PSI-BLAST, with the goal of improving accuracy in finding true positive matches. To evaluate performance we used a set of 103 queries for which the true positives in yeast had been annotated by human experts, and a popular measure of retrieval accuracy (ROC) that can be normalized to take on values between 0 (worst) and 1 (best). The modifications we consider novel improve the ROC score from 0.758 +/- 0.005 to 0.895 +/- 0.003. This does not include the benefits from four modifications we included in the 'baseline' version, even though they were not implemented in PSI-BLAST version 2.0. The improvement in accuracy was confirmed on a small second test set. This test involved analyzing three protein families with curated lists of true positives from the non-redundant protein database. The modification that accounts for the majority of the improvement is the use, for each database sequence, of a position-specific scoring system tuned to that sequence's amino acid composition. The use of composition-based statistics is particularly beneficial for large-scale automated applications of PSI-BLAST.Keywords
This publication has 50 references indexed in Scilit:
- An improved algorithm for matching biological sequencesPublished by Elsevier ,2004
- The Protein Data BankNucleic Acids Research, 2000
- Fold prediction and evolutionary analysis of the POZ domain: structural and evolutionary relationship with the potassium channel tetramerization domain 1 1Edited by F. CohenJournal of Molecular Biology, 1999
- Comparison of the Complete Protein Sets of Worm and Yeast: Orthology and DivergenceScience, 1998
- Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methodsJournal of Molecular Biology, 1998
- [27] Local alignment statisticsPublished by Elsevier ,1996
- Volume changes in protein evolutionJournal of Molecular Biology, 1994
- Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.Proceedings of the National Academy of Sciences, 1990
- Weights for data related by a treeJournal of Molecular Biology, 1989
- Improved tools for biological sequence comparison.Proceedings of the National Academy of Sciences, 1988