A Subquadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices
- 1 January 2003
- journal article
- Published by Society for Industrial & Applied Mathematics (SIAM) in SIAM Journal on Computing
- Vol. 32 (6) , 1654-1673
- https://doi.org/10.1137/s0097539702402007
Abstract
International audienceGiven two strings of size n over a constant alphabet, the classical algorithm for computing the similarity between two sequences [D. Sankoff and J. B. Kruskal, eds., Time Warps, String Edits, and Macromolecules; Addison-Wesley, Reading, MA, 1983; T. F. Smith and M. S. Waterman, J. Molec. Biol., 147 (1981), pp. 195-197] uses a dynamic programming matrix and compares the two strings in O(n²) time. We address the challenge of computing the similarity of two strings in subquadratic time for metrics which use a scoring matrix of unrestricted weights. Our algorithm applies to both local and global similarity computations. The speed-up is achieved by dividing the dynamic programming matrix into variable sized blocks, as induced by Lempel-Ziv parsing of both strings, and utilizing the inherent periodic nature of both strings. This leads to an O(n² / log n) algorithm for an input of constant alphabet size. For most texts, the time complexity is actually O(h n² / log n), where h≤1 is the entropy of the text. We also present an algorithm for comparing two run-length encoded strings of length m and n, compressed into m' and n' runs, respectively, in O(m'n+n'm) complexity. This result extends to all distance or similarity scoring schemes that use an additive gap penaltyKeywords
This publication has 32 references indexed in Scilit:
- Matching for Run-Length Encoded StringsJournal of Complexity, 1999
- Let Sleeping Files Lie: Pattern Matching in Z-Compressed FilesJournal of Computer and System Sciences, 1996
- A space efficient algorithm for finding the best nonoverlapping alignment scoreTheoretical Computer Science, 1995
- An improved algorithm for computing the edit distance of run-length coded stringsInformation Processing Letters, 1995
- Recent Developments in Linear-Space Alignment Methods: A SurveyJournal of Computational Biology, 1994
- Efficient Parallel Algorithms for String Editing and Related ProblemsSIAM Journal on Computing, 1990
- Sequence comparison with mixed convex and concave costsJournal of Algorithms, 1990
- Geometric applications of a matrix-searching algorithmAlgorithmica, 1987
- Transducers and repetitionsTheoretical Computer Science, 1986
- Universal codeword sets and representations of the integersIEEE Transactions on Information Theory, 1975