An Eulerian Path Approach to Global Multiple Alignment for DNA Sequences
- 1 December 2003
- journal article
- research article
- Published by Mary Ann Liebert Inc in Journal of Computational Biology
- Vol. 10 (6) , 803-819
- https://doi.org/10.1089/106652703322756096
Abstract
With the rapid increase in the dataset of genome sequences, the multiple sequence alignment problem is increasingly important and frequently involves the alignment of a large number of sequences. Many heuristic algorithms have been proposed to improve the speed of computation and the quality of alignment. We introduce a novel approach that is fundamentally different from all currently available methods. Our motivation comes from the Eulerian method for fragment assembly in DNA sequencing that transforms all DNA fragments into a de Bruijn graph and then reduces sequence assembly to a Eulerian path problem. The paper focuses on global multiple alignment of DNA sequences, where entire sequences are aligned into one configuration. Our main result is an algorithm with almost linear computational speed with respect to the total size (number of letters) of sequences to be aligned. Five hundred simulated sequences (averaging 500 bases per sequence and as low as 70% pairwise identity) have been aligned within three minutes on a personal computer, and the quality of alignment is satisfactory. As a result, accurate and simultaneous alignment of thousands of long sequences within a reasonable amount of time becomes possible. Data from an Arabidopsis sequencing project is used to demonstrate the performance.Keywords
This publication has 31 references indexed in Scilit:
- Multiple sequence alignment: Algorithms and applicationsAdvances in Biophysics, 1999
- Base-Calling of Automated Sequencer Traces Using Phred. II. Error ProbabilitiesGenome Research, 1998
- Significant Improvement in Accuracy of Multiple Protein Sequence Alignments by Iterative Refinement as Assessed by Reference to Structural AlignmentsJournal of Molecular Biology, 1996
- A New Algorithm for DNA Sequence AssemblyJournal of Computational Biology, 1995
- Weights for data related by a treeJournal of Molecular Biology, 1989
- A tool for multiple sequence alignment.Proceedings of the National Academy of Sciences, 1989
- Tutorial on large deviations for the binomial distributionBulletin of Mathematical Biology, 1989
- A strategy for the rapid multiple alignment of protein sequencesJournal of Molecular Biology, 1987
- Progressive sequence alignment as a prerequisitetto correct phylogenetic treesJournal of Molecular Evolution, 1987
- Evaluation and improvements in the automatic alignment of protein sequencesProtein Engineering, Design and Selection, 1987