Biological Evaluation of d², an Algorithm for High-Performance Sequence Comparison

1 January 1994

journal article
research article
Published by Mary Ann Liebert Inc in Journal of Computational Biology

Vol. 1 (3) , 199-215
https://doi.org/10.1089/cmb.1994.1.199

Abstract

A number of algorithms exist for searching sequence databases for biologically significant similarities based on the primary sequence similarity of aligned sequences. We have determined the biological sensitivity and selectivity of d2, a high-performance comparison algorithm that rapidly determines the relative dissimilarity of large datasets of genetic sequences. d2 uses sequence-word multiplicity as a simple measure of dissimilarity. It is not constrained by the comparison of direct sequence alignments and so can use word contexts to yield new information on relationships. It is extremely efficient, comparing a query of length 884 bases (INS1ECLAC) with 19,540,603 bases of the bacterial division of GenBank (release 76.0) in 51.77 CPU seconds on a Cray Y/MP-48 supercomputer. It is unique in that subsequences (words) of biological interest can be weighted to improve the sensitivity and selectivity of a search over existing methods. We have determined the ability of d2 to detect biologically significant matches between a query and large datasets of DNA sequences while varying parameters such as word-length and window size. We have also determined the distribution of dissimilarity scores within eukaryotic and prokaryotic divisions of GenBank. We have optimized parameters of the d2 program using Cray hardware and present an analysis of the sensitivity and selectivity of the algorithm. A theoretical analysis of the expectation for scores is presented. This work demonstrates that d2 is a unique, sensitive, and selective method of rapid sequence comparison that can detect novel sequence relationships which remain undetected by alternate methodologies.

Keywords

This publication has 10 references indexed in Scilit:

Quick assessment of similarity of two sequences by comparison of their L-tuple frequencies
Biosystems, 1993
Base compositional structure of genomes
Genomics, 1992
Structure and evolution of the lipase superfamily.
Journal of Lipid Research, 1992
Basic local alignment search tool
Journal of Molecular Biology, 1990
[5] Rapid and sensitive sequence comparison with FASTP and FASTA
Published by Elsevier ,1990
The Genbank-Server at the University of Houston
Nucleic Acids Research, 1990
Effectiveness of measures requiring and not requiring prior sequence alignment for estimating the dissimilarity of natural sequences
Journal of Molecular Evolution, 1989
Statistical method for rapid homology search
Nucleic Acids Research, 1988
A measure of the similarity of sets of sequences not requiring sequence alignment.
Proceedings of the National Academy of Sciences, 1986

Biological Evaluation of d2, an Algorithm for High-Performance Sequence Comparison

Abstract

Keywords

Biological Evaluation of d², an Algorithm for High-Performance Sequence Comparison