Linear approximation of shortest superstrings

1 July 1994

journal article
Published by Association for Computing Machinery (ACM) in Journal of the ACM

Vol. 41 (4) , 630-647
https://doi.org/10.1145/179812.179818

Abstract

We consider the following problem: given a collection of strings s ₁ ,…, s _m , find the shortest string s such that each s _i appears as a substring (a consecutive block) of s . Although this problem is known to be NP-hard, a simple greedy procedure appears to do quite well and is routinely used in DNA sequencing and data compression practice, namely: repeatedly merge the pair of (distinct) strings with maximum overlap until only one string remains. Let n denote the length of the optimal superstring. A common conjecture states that the above greedy procedure produces a superstring of length O(n) (in fact, 2 n ), yet the only previous nontrivial bound known for any polynomial-time algorithm is a recent O(n log n ) result. We show that the greedy algorithm does in fact achieve a constant factor approximation, proving an upper bound of 4 n . Furthermore, we present a simple modified version of the greedy algorithm that we show produces a superstring of length at most 3 n . We also show the superstring problem to be MAXSNP-hard, which implies that a polynomial-time approximation scheme for this problem is unlikely.

Keywords

This publication has 6 references indexed in Scilit:

The Traveling Salesman Problem with Distances One and Two
Mathematics of Operations Research, 1993
Approximation algorithms for the shortest common superstring problem
Information and Computation, 1989
A greedy approximation algorithm for constructing shortest common superstrings
Theoretical Computer Science, 1988
A theory of the learnable
Communications of the ACM, 1984
On finding minimal length superstrings
Journal of Computer and System Sciences, 1980
Uniqueness theorems for periodic functions
Proceedings of the American Mathematical Society, 1965