The identification of complete domains within protein sequences using accurate E-values for semi-global alignment
Open Access
- 27 June 2007
- journal article
- research article
- Published by Oxford University Press (OUP) in Nucleic Acids Research
- Vol. 35 (14) , 4678-4685
- https://doi.org/10.1093/nar/gkm414
Abstract
The sequencing of complete genomes has created a pressing need for automated annotation of gene function. Because domains are the basic units of protein function and evolution, a gene can be annotated from a domain database by aligning domains to the corresponding protein sequence. Ideally, complete domains are aligned to protein subsequences, in a ‘semi-global alignment’. Local alignment, which aligns pieces of domains to subsequences, is common in high-throughput annotation applications, however. It is a mature technique, with the heuristics and accurate E-values required for screening large databases and evaluating the screening results. Hidden Markov models (HMMs) provide an alternative theoretical framework for semi-global alignment, but their use is limited because they lack heuristic acceleration and accurate E-values. Our new tool, GLOBAL, overcomes some limitations of previous semi-global HMMs: it has accurate E-values and the possibility of the heuristic acceleration required for high-throughput applications. Moreover, according to a standard of truth based on protein structure, two semi-global HMM alignment tools (GLOBAL and HMMer) had comparable performance in identifying complete domains, but distinctly outperformed two tools based on local alignment. When searching for complete protein domains, therefore, GLOBAL avoids disadvantages commonly associated with HMMs, yet maintains their superior retrieval performance.Keywords
This publication has 39 references indexed in Scilit:
- MEME: discovering and analyzing DNA and protein sequence motifsNucleic Acids Research, 2006
- Identification of common molecular subsequencesPublished by Elsevier ,2004
- Development of an Intelligent Energy Management Network for Building AutomationIEEE Transactions on Automation Science and Engineering, 2004
- Maximum likelihood fitting of FROC curves under an initial‐detection‐and‐candidate‐analysis modelMedical Physics, 2002
- Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methodsJournal of Molecular Biology, 1998
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997
- Limit Distribution of Maximal Non-Aligned Two-Sequence Segmental ScoreThe Annals of Probability, 1994
- Basic local alignment search toolJournal of Molecular Biology, 1990
- Methods for discovering novel motifs in nucleic acid sequencesBioinformatics, 1989
- The area above the ordinal dominance graph and the area below the receiver operating characteristic graphJournal of Mathematical Psychology, 1975