Generalized suffix trees for biological sequence data: applications and implementation

1 January 1994

conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

Vol. 183, 35-44
https://doi.org/10.1109/hicss.1994.323593

Abstract

This paper addresses applications of suffix trees and generalized suffix trees (GSTs) to biological sequence data analysis. We define a basic set of suffix trees and GST operations needed to support sequence data analysis. While those definitions are straightforward, the construction and manipulation of disk-based GST structures for large volumes of sequence data requires intricate design. GST processing is fast because the structure is content addressable, supporting efficient searches for all sequences that contain particular subsequences. Instead of laboriously searching sequences stored as arrays, we search by walking down the tree. We present a new GST-based sequence alignment algorithm, called GESTALT. GESTALT finds all exact matches in parallel, and uses best-first search to extend them to produce alignments. Our implementation experiences with applications using GST structures for sequence analysis lead us to conclude that GSTs are valuable tools for analyzing biological sequence data.

Keywords

This publication has 25 references indexed in Scilit:

Optimal sequence alignment allowing for long gaps
Bulletin of Mathematical Biology, 1990
Consistency of optimal sequence alignments
Bulletin of Mathematical Biology, 1990
Fast parallel and serial approximate string matching
Journal of Algorithms, 1989
Estimating the information content of symbol sequences and efficient codes
IEEE Transactions on Information Theory, 1989
The alignment of protein structures in three dimensions
Bulletin of Mathematical Biology, 1989
Locating alignments with k differences for nucleotide and amino acid sequences
Bioinformatics, 1988
Finding approximate patterns in strings
Journal of Algorithms, 1985
Efficient sequence alignment algorithms
Journal of Theoretical Biology, 1984
An efficient method for finding repeats in molecular sequences
Nucleic Acids Research, 1983
Some biological sequence metrics
Advances in Mathematics, 1976