Error Tolerant Indexing and Alignment of Short Reads with Covering Template Families
- 1 October 2010
- journal article
- research article
- Published by Mary Ann Liebert Inc in Journal of Computational Biology
- Vol. 17 (10) , 1397-1411
- https://doi.org/10.1089/cmb.2010.0005
Abstract
The rapid adoption of high-throughput next generation sequence data in biological research is presenting a major challenge for sequence alignment tools—specifically, the efficient alignment of vast amounts of short reads to large references in the presence of differences arising from sequencing errors and biological sequence variations. To address this challenge, we developed a short read aligner for high-throughput sequencer data that is tolerant of errors or mutations of all types—namely, substitutions, deletions, and insertions. The aligner utilizes a multi-stage approach in which template-based indexing is used to identify candidate regions for alignment with dynamic programming. A template is a pair of gapped seeds, with one used with the read and one used with the reference. In this article, we focus on the development of template families that yield error-tolerant indexing up to a given error-budget. A general algorithm for finding those families is presented, and a recursive construction that creates families with higher error tolerance from ones with a lower error tolerance is developed.Keywords
This publication has 24 references indexed in Scilit:
- Quantification of the yeast transcriptome by single-molecule sequencingNature Biotechnology, 2009
- The diploid genome sequence of an Asian individualNature, 2008
- Mapping short DNA sequencing reads and calling variants using mapping quality scoresGenome Research, 2008
- Single-Molecule DNA Sequencing of a Viral GenomeScience, 2008
- Genome-Wide Mapping of in Vivo Protein-DNA InteractionsScience, 2007
- Multiseed Lossless FiltrationIEEE/ACM Transactions on Computational Biology and Bioinformatics, 2005
- Sensitivity analysis and efficient method for identifying optimal spaced seedsJournal of Computer and System Sciences, 2004
- A fast bit-vector algorithm for approximate string matching based on dynamic programmingJournal of the ACM, 1999
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997
- Multiple filtration and approximate pattern matchingAlgorithmica, 1995