A Subquadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices

1 January 2003

journal article
Published by Society for Industrial & Applied Mathematics (SIAM) in SIAM Journal on Computing

Vol. 32 (6) , 1654-1673
https://doi.org/10.1137/s0097539702402007

Abstract

International audienceGiven two strings of size n over a constant alphabet, the classical algorithm for computing the similarity between two sequences [D. Sankoff and J. B. Kruskal, eds., Time Warps, String Edits, and Macromolecules; Addison-Wesley, Reading, MA, 1983; T. F. Smith and M. S. Waterman, J. Molec. Biol., 147 (1981), pp. 195-197] uses a dynamic programming matrix and compares the two strings in O(n²) time. We address the challenge of computing the similarity of two strings in subquadratic time for metrics which use a scoring matrix of unrestricted weights. Our algorithm applies to both local and global similarity computations. The speed-up is achieved by dividing the dynamic programming matrix into variable sized blocks, as induced by Lempel-Ziv parsing of both strings, and utilizing the inherent periodic nature of both strings. This leads to an O(n² / log n) algorithm for an input of constant alphabet size. For most texts, the time complexity is actually O(h n² / log n), where h≤1 is the entropy of the text. We also present an algorithm for comparing two run-length encoded strings of length m and n, compressed into m' and n' runs, respectively, in O(m'n+n'm) complexity. This result extends to all distance or similarity scoring schemes that use an additive gap penalty

Keywords

This publication has 32 references indexed in Scilit:

Matching for Run-Length Encoded Strings
Journal of Complexity, 1999
Let Sleeping Files Lie: Pattern Matching in Z-Compressed Files
Journal of Computer and System Sciences, 1996
A space efficient algorithm for finding the best nonoverlapping alignment score
Theoretical Computer Science, 1995
An improved algorithm for computing the edit distance of run-length coded strings
Information Processing Letters, 1995
Recent Developments in Linear-Space Alignment Methods: A Survey
Journal of Computational Biology, 1994
Efficient Parallel Algorithms for String Editing and Related Problems
SIAM Journal on Computing, 1990
Sequence comparison with mixed convex and concave costs
Journal of Algorithms, 1990
Geometric applications of a matrix-searching algorithm
Algorithmica, 1987
Transducers and repetitions
Theoretical Computer Science, 1986
Universal codeword sets and representations of the integers
IEEE Transactions on Information Theory, 1975