The identification of complete domains within protein sequences using accurate E-values for semi-global alignment

Open Access

27 June 2007

journal article
research article
Published by Oxford University Press (OUP) in Nucleic Acids Research

Vol. 35 (14) , 4678-4685
https://doi.org/10.1093/nar/gkm414

Abstract

The sequencing of complete genomes has created a pressing need for automated annotation of gene function. Because domains are the basic units of protein function and evolution, a gene can be annotated from a domain database by aligning domains to the corresponding protein sequence. Ideally, complete domains are aligned to protein subsequences, in a ‘semi-global alignment’. Local alignment, which aligns pieces of domains to subsequences, is common in high-throughput annotation applications, however. It is a mature technique, with the heuristics and accurate E-values required for screening large databases and evaluating the screening results. Hidden Markov models (HMMs) provide an alternative theoretical framework for semi-global alignment, but their use is limited because they lack heuristic acceleration and accurate E-values. Our new tool, GLOBAL, overcomes some limitations of previous semi-global HMMs: it has accurate E-values and the possibility of the heuristic acceleration required for high-throughput applications. Moreover, according to a standard of truth based on protein structure, two semi-global HMM alignment tools (GLOBAL and HMMer) had comparable performance in identifying complete domains, but distinctly outperformed two tools based on local alignment. When searching for complete protein domains, therefore, GLOBAL avoids disadvantages commonly associated with HMMs, yet maintains their superior retrieval performance.

Keywords

This publication has 39 references indexed in Scilit:

MEME: discovering and analyzing DNA and protein sequence motifs
Nucleic Acids Research, 2006
Identification of common molecular subsequences
Published by Elsevier ,2004
Development of an Intelligent Energy Management Network for Building Automation
IEEE Transactions on Automation Science and Engineering, 2004
Maximum likelihood fitting of FROC curves under an initial‐detection‐and‐candidate‐analysis model
Medical Physics, 2002
Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods
Journal of Molecular Biology, 1998
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
Limit Distribution of Maximal Non-Aligned Two-Sequence Segmental Score
The Annals of Probability, 1994
Basic local alignment search tool
Journal of Molecular Biology, 1990
Methods for discovering novel motifs in nucleic acid sequences
Bioinformatics, 1989
The area above the ordinal dominance graph and the area below the receiver operating characteristic graph
Journal of Mathematical Psychology, 1975