A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation
Open Access
- 30 May 2008
- journal article
- research article
- Published by Public Library of Science (PLoS) in PLoS Computational Biology
- Vol. 4 (5) , e1000069
- https://doi.org/10.1371/journal.pcbi.1000069
Abstract
Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (λ) requires time-consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty (“Forward” scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores (“Viterbi” scores) are Gumbel-distributed with constant λ = log 2, and the high scoring tail of Forward scores is exponential with the same constant λ. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profile-hidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments. Sequence database searches are a fundamental tool of molecular biology, enabling researchers to identify related sequences in other organisms, which often provides invaluable clues to the function and evolutionary history of genes. The power of database searches to detect more and more remote evolutionary relationships – essentially, to look back deeper in time – has improved steadily, with the adoption of more complex and realistic models. However, database searches require not just a realistic scoring model, but also the ability to distinguish good scores from bad ones – the ability to calculate the statistical significance of scores. For many models and scoring schemes, accurate statistical significance calculations have either involved expensive computational simulations, or not been feasible at all. Here, I introduce a probabilistic model of local sequence alignment that has readily predictable score statistics for position-specific profile scoring systems, and not just for traditional optimal alignment scores, but also for more powerful log-likelihood ratio scores derived in a full probabilistic inference framework. These results remove one of the main obstacles that have impeded the use of more powerful and biologically realistic statistical inference methods in sequence homology searches.Keywords
This publication has 52 references indexed in Scilit:
- The identification of complete domains within protein sequences using accurate E-values for semi-global alignmentNucleic Acids Research, 2007
- Query-Dependent Banding (QDB) for Faster RNA Similarity SearchesPLoS Computational Biology, 2007
- CDD: a conserved domain database for interactive domain family analysisNucleic Acids Research, 2006
- Pfam: clans, web tools and servicesNucleic Acids Research, 2006
- Accurate formula for P-values of gapped local sequence and profile alignmentsJournal of Molecular Biology, 2000
- Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methodsJournal of Molecular Biology, 1998
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997
- A reliable sequence alignment method based on probabilities of residue correspondencesProtein Engineering, Design and Selection, 1995
- Hidden Markov Models in Computational BiologyJournal of Molecular Biology, 1994
- Basic local alignment search toolJournal of Molecular Biology, 1990