Homology-extended sequence alignment

Open Access

18 February 2005

journal article
research article
Published by Oxford University Press (OUP) in Nucleic Acids Research

Vol. 33 (3) , 816-824
https://doi.org/10.1093/nar/gki233

Abstract

We present a profile–profile multiple alignment strategy that uses database searching to collect homologues for each sequence in a given set, in order to enrich their available evolutionary information for the alignment. For each of the alignment sequences, the putative homologous sequences that score above a pre-defined threshold are incorporated into a position-specific pre-alignment profile. The enriched position-specific profile is used for standard progressive alignment, thereby more accurately describing the characteristic features of the given sequence set. We show that owing to the incorporation of the pre-alignment information into a standard progressive multiple alignment routine, the alignment quality between distant sequences increases significantly and outperforms state-of-the-art methods, such as T-COFFEE and MUSCLE. We also show that although entirely sequence-based, our novel strategy is better at aligning distant sequences when compared with a recent contact-based alignment method. Therefore, our pre-alignment profile strategy should be advantageous for applications that rely on high alignment accuracy such as local structure prediction, comparative modelling and threading.

Keywords

This publication has 48 references indexed in Scilit:

MUSCLE: multiple sequence alignment with high accuracy and high throughput
Nucleic Acids Research, 2004
A comparison of scoring functions for protein sequence profile alignment
Bioinformatics, 2004
Within the twilight zone: a sensitive profile-profile comparison tool based on information theory
Journal of Molecular Biology, 2002
T-coffee: a novel method for fast and accurate multiple sequence alignment 1 1Edited by J. Thornton
Journal of Molecular Biology, 2000
Comparison of sequence profiles. Strategies for structural predictions using sequence information
Protein Science, 2000
Dynamic sequence databank searching with templates and multiple alignment 1 1Edited by J. Karn
Journal of Molecular Biology, 1998
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
Nucleic Acids Research, 1994
The rapid generation of mutation data matrices from protein sequences
Bioinformatics, 1992
Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features
Biopolymers, 1983