Generalized suffix trees for biological sequence data: applications and implementation
- 1 January 1994
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- Vol. 183, 35-44
- https://doi.org/10.1109/hicss.1994.323593
Abstract
This paper addresses applications of suffix trees and generalized suffix trees (GSTs) to biological sequence data analysis. We define a basic set of suffix trees and GST operations needed to support sequence data analysis. While those definitions are straightforward, the construction and manipulation of disk-based GST structures for large volumes of sequence data requires intricate design. GST processing is fast because the structure is content addressable, supporting efficient searches for all sequences that contain particular subsequences. Instead of laboriously searching sequences stored as arrays, we search by walking down the tree. We present a new GST-based sequence alignment algorithm, called GESTALT. GESTALT finds all exact matches in parallel, and uses best-first search to extend them to produce alignments. Our implementation experiences with applications using GST structures for sequence analysis lead us to conclude that GSTs are valuable tools for analyzing biological sequence data.Keywords
This publication has 25 references indexed in Scilit:
- Optimal sequence alignment allowing for long gapsBulletin of Mathematical Biology, 1990
- Consistency of optimal sequence alignmentsBulletin of Mathematical Biology, 1990
- Fast parallel and serial approximate string matchingJournal of Algorithms, 1989
- Estimating the information content of symbol sequences and efficient codesIEEE Transactions on Information Theory, 1989
- The alignment of protein structures in three dimensionsBulletin of Mathematical Biology, 1989
- Locating alignments with k differences for nucleotide and amino acid sequencesBioinformatics, 1988
- Finding approximate patterns in stringsJournal of Algorithms, 1985
- Efficient sequence alignment algorithmsJournal of Theoretical Biology, 1984
- An efficient method for finding repeats in molecular sequencesNucleic Acids Research, 1983
- Some biological sequence metricsAdvances in Mathematics, 1976