Biological Evaluation of d2, an Algorithm for High-Performance Sequence Comparison

Abstract
A number of algorithms exist for searching sequence databases for biologically significant similarities based on the primary sequence similarity of aligned sequences. We have determined the biological sensitivity and selectivity of d2, a high-performance comparison algorithm that rapidly determines the relative dissimilarity of large datasets of genetic sequences. d2 uses sequence-word multiplicity as a simple measure of dissimilarity. It is not constrained by the comparison of direct sequence alignments and so can use word contexts to yield new information on relationships. It is extremely efficient, comparing a query of length 884 bases (INS1ECLAC) with 19,540,603 bases of the bacterial division of GenBank (release 76.0) in 51.77 CPU seconds on a Cray Y/MP-48 supercomputer. It is unique in that subsequences (words) of biological interest can be weighted to improve the sensitivity and selectivity of a search over existing methods. We have determined the ability of d2 to detect biologically significant matches between a query and large datasets of DNA sequences while varying parameters such as word-length and window size. We have also determined the distribution of dissimilarity scores within eukaryotic and prokaryotic divisions of GenBank. We have optimized parameters of the d2 program using Cray hardware and present an analysis of the sensitivity and selectivity of the algorithm. A theoretical analysis of the expectation for scores is presented. This work demonstrates that d2 is a unique, sensitive, and selective method of rapid sequence comparison that can detect novel sequence relationships which remain undetected by alternate methodologies.