Reconstructing Strings from Substrings

1 January 1995

journal article
research article
Published by Mary Ann Liebert Inc in Journal of Computational Biology

Vol. 2 (2) , 333-353
https://doi.org/10.1089/cmb.1995.2.333

Abstract

We consider an interactive approach to DNA sequencing by hybridization, where we are permitted to ask questions of the form "is s a substring of the unknown sequence S?", where s is a specific query string. We are not told where s occurs in S, nor how many times it occurs, just whether or not s a substring of S. Our goal is to determine the exact contents of S using as few queries as possible. Through interaction, far fewer queries are necessary than using conventional fixed sequencing by hybridization (SBH) sequencing chips. We provide tight bounds on the complexity of reconstructing unknown strings from substring queries. Our lower bound, which holds even for a stronger model that returns the number of occurrences of s as a substring of S, relies on interesting arguments based on de Bruijn sequences. We also demonstrate that subsequence queries are significantly more powerful than substring queries, matching the information theoretic lower bound. Finally, in certain applications, something may already be known about the unknown string, and hence it can be determined faster than an arbitrary string. We show that building an optimal decision tree is NP-complete, then give an approximation algorithm that gives trees within a constant multiplicative factor of optimal.

Keywords

This publication has 18 references indexed in Scilit:

Decision trees for geometric models
Published by Association for Computing Machinery (ACM) ,1993
DNA Sequencing by Primer Walking with Strings of Contiguous Hexamers
Science, 1992
Light-Directed, Spatially Addressable Parallel Chemical Synthesis
Science, 1991
A novel method for nucleic acid sequence determination
Journal of Theoretical Biology, 1988
Algorithms for finding K-best perfect matchings
Discrete Applied Mathematics, 1987
String overlaps, pattern matching, and nontransitive games
Journal of Combinatorial Theory, Series A, 1981
On finding minimal length superstrings
Journal of Computer and System Sciences, 1980
Constructing optimal binary decision trees is NP-complete
Information Processing Letters, 1976
Matching, Euler tours and the Chinese postman
Mathematical Programming, 1973
Normal Recurring Decimals
Journal of the London Mathematical Society, 1946