Reconstructing Strings from Substrings
- 1 January 1995
- journal article
- research article
- Published by Mary Ann Liebert Inc in Journal of Computational Biology
- Vol. 2 (2) , 333-353
- https://doi.org/10.1089/cmb.1995.2.333
Abstract
We consider an interactive approach to DNA sequencing by hybridization, where we are permitted to ask questions of the form "is s a substring of the unknown sequence S?", where s is a specific query string. We are not told where s occurs in S, nor how many times it occurs, just whether or not s a substring of S. Our goal is to determine the exact contents of S using as few queries as possible. Through interaction, far fewer queries are necessary than using conventional fixed sequencing by hybridization (SBH) sequencing chips. We provide tight bounds on the complexity of reconstructing unknown strings from substring queries. Our lower bound, which holds even for a stronger model that returns the number of occurrences of s as a substring of S, relies on interesting arguments based on de Bruijn sequences. We also demonstrate that subsequence queries are significantly more powerful than substring queries, matching the information theoretic lower bound. Finally, in certain applications, something may already be known about the unknown string, and hence it can be determined faster than an arbitrary string. We show that building an optimal decision tree is NP-complete, then give an approximation algorithm that gives trees within a constant multiplicative factor of optimal.Keywords
This publication has 18 references indexed in Scilit:
- Decision trees for geometric modelsPublished by Association for Computing Machinery (ACM) ,1993
- DNA Sequencing by Primer Walking with Strings of Contiguous HexamersScience, 1992
- Light-Directed, Spatially Addressable Parallel Chemical SynthesisScience, 1991
- A novel method for nucleic acid sequence determinationJournal of Theoretical Biology, 1988
- Algorithms for finding K-best perfect matchingsDiscrete Applied Mathematics, 1987
- String overlaps, pattern matching, and nontransitive gamesJournal of Combinatorial Theory, Series A, 1981
- On finding minimal length superstringsJournal of Computer and System Sciences, 1980
- Constructing optimal binary decision trees is NP-completeInformation Processing Letters, 1976
- Matching, Euler tours and the Chinese postmanMathematical Programming, 1973
- Normal Recurring DecimalsJournal of the London Mathematical Society, 1946