Reconstructing Strings from Substrings

Abstract
We consider an interactive approach to DNA sequencing by hybridization, where we are permitted to ask questions of the form "is s a substring of the unknown sequence S?", where s is a specific query string. We are not told where s occurs in S, nor how many times it occurs, just whether or not s a substring of S. Our goal is to determine the exact contents of S using as few queries as possible. Through interaction, far fewer queries are necessary than using conventional fixed sequencing by hybridization (SBH) sequencing chips. We provide tight bounds on the complexity of reconstructing unknown strings from substring queries. Our lower bound, which holds even for a stronger model that returns the number of occurrences of s as a substring of S, relies on interesting arguments based on de Bruijn sequences. We also demonstrate that subsequence queries are significantly more powerful than substring queries, matching the information theoretic lower bound. Finally, in certain applications, something may already be known about the unknown string, and hence it can be determined faster than an arbitrary string. We show that building an optimal decision tree is NP-complete, then give an approximation algorithm that gives trees within a constant multiplicative factor of optimal.

This publication has 18 references indexed in Scilit: