An Eulerian path approach to local multiple alignment for DNA sequences
- 24 January 2005
- journal article
- Published by Proceedings of the National Academy of Sciences in Proceedings of the National Academy of Sciences
- Vol. 102 (5) , 1285-1290
- https://doi.org/10.1073/pnas.0409240102
Abstract
Expensive computation in handling a large number of sequences limits the application of local multiple sequence alignment. We present an Eulerian path approach to local multiple alignment for DNA sequences. The computational time and memory usage of this approach is approximately linear to the total size of sequences analyzed; hence, it can handle thousands of sequences or millions of letters simultaneously. By constructing a De Bruijn graph, most of the conserved segments are amplified as heavy Eulerian paths in the graph, and the original patterns distributed in sequences are recovered even if they do not exist in any single sequence. This approach can accurately detect unknown conserved regions, for both short and long, conserved and degenerate patterns. We further present a Poisson heuristic to estimate the significance of a local multiple alignment. The performance of our method is demonstrated by finding Alu repeats in the human genome. We compare the results with Alus marked byrepeatmasker, where the two programs are in good agreement. Our method is robust under various conditions and superior to other methods in terms of efficiency and accuracy.Keywords
This publication has 29 references indexed in Scilit:
- Identification of common molecular subsequencesPublished by Elsevier ,2004
- Aligning Multiple Genomic Sequences With the Threaded Blockset AlignerGenome Research, 2004
- An Eulerian Path Approach to Global Multiple Alignment for DNA SequencesJournal of Computational Biology, 2003
- LAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple Alignment of Genomic DNAGenome Research, 2003
- A New Algorithm for DNA Sequence AssemblyJournal of Computational Biology, 1995
- Use of the insect/baculovirus expression system to produce lipid-modified ras proteins for electrospray mass spectrometric studiesMethods, 1990
- Basic Local Alignment Search ToolJournal of Molecular Biology, 1990
- Basic local alignment search toolJournal of Molecular Biology, 1990
- A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisonsJournal of Molecular Biology, 1987
- An Extreme Value Theory for Sequence MatchingThe Annals of Statistics, 1986