Simple is beautiful: a straightforward approach to improve the delineation of true and false positives in PSI-BLAST searches

Open Access

10 April 2008

journal article
research article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 24 (11) , 1339-1343
https://doi.org/10.1093/bioinformatics/btn130

Abstract

Motivation: The deluge of biological information from different genomic initiatives and the rapid advancement in biotechnologies have made bioinformatics tools an integral part of modern biology. Among the widely used sequence alignment tools, BLAST and PSI-BLAST are arguably the most popular. PSI-BLAST, which uses an iterative profile position specific score matrix (PSSM)-based search strategy, is more sensitive than BLAST in detecting weak homologies, thus making it suitable for remote homolog detection. Many refinements have been made to improve PSI-BLAST, and its computational efficiency and high specificity have been much touted. Nevertheless, corruption of its profile via the incorporation of false positive sequences remains a major challenge. Results: We have developed a simple and elegant approach to resolve the problem of model corruption in PSI-BLAST searches. We hypothesized that combining results from the first (least-corrupted) profile with results from later (most sensitive) iterations of PSI-BLAST provides a better discriminator for true and false hits. Accordingly, we have derived a formula that utilizes the E-values from these two PSI-BLAST iterations to obtain a figure of merit for rank-ordering the hits. Our verification results based on a ‘gold-standard’ test set indicate that this figure of merit does indeed delineate true positives from false positives better than PSI-BLAST E-values. Perhaps what is most notable about this strategy is that it is simple and straightforward to implement. Contact:bundschuh@mps.ohio-state.edu

Keywords

This publication has 34 references indexed in Scilit:

SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data
Bioinformatics, 2007
Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure
Journal of Molecular Biology, 2001
Gleaning non-trivial structural, functional and evolutionary information about proteins by iterative database searches
Journal of Molecular Biology, 1999
Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods
Journal of Molecular Biology, 1998
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
Significant Improvement in Accuracy of Multiple Protein Sequence Alignments by Iterative Refinement as Assessed by Reference to Structural Alignments
Journal of Molecular Biology, 1996
Hidden Markov models for sequence analysis: extension and analysis of the basic method
Bioinformatics, 1996
Sequence alignment and penalty choice
Journal of Molecular Biology, 1994
Basic Local Alignment Search Tool
Journal of Molecular Biology, 1990
Basic local alignment search tool
Journal of Molecular Biology, 1990