Biological Evaluation of d2, an Algorithm for High-Performance Sequence Comparison
- 1 January 1994
- journal article
- research article
- Published by Mary Ann Liebert Inc in Journal of Computational Biology
- Vol. 1 (3) , 199-215
- https://doi.org/10.1089/cmb.1994.1.199
Abstract
A number of algorithms exist for searching sequence databases for biologically significant similarities based on the primary sequence similarity of aligned sequences. We have determined the biological sensitivity and selectivity of d2, a high-performance comparison algorithm that rapidly determines the relative dissimilarity of large datasets of genetic sequences. d2 uses sequence-word multiplicity as a simple measure of dissimilarity. It is not constrained by the comparison of direct sequence alignments and so can use word contexts to yield new information on relationships. It is extremely efficient, comparing a query of length 884 bases (INS1ECLAC) with 19,540,603 bases of the bacterial division of GenBank (release 76.0) in 51.77 CPU seconds on a Cray Y/MP-48 supercomputer. It is unique in that subsequences (words) of biological interest can be weighted to improve the sensitivity and selectivity of a search over existing methods. We have determined the ability of d2 to detect biologically significant matches between a query and large datasets of DNA sequences while varying parameters such as word-length and window size. We have also determined the distribution of dissimilarity scores within eukaryotic and prokaryotic divisions of GenBank. We have optimized parameters of the d2 program using Cray hardware and present an analysis of the sensitivity and selectivity of the algorithm. A theoretical analysis of the expectation for scores is presented. This work demonstrates that d2 is a unique, sensitive, and selective method of rapid sequence comparison that can detect novel sequence relationships which remain undetected by alternate methodologies.Keywords
This publication has 10 references indexed in Scilit:
- Quick assessment of similarity of two sequences by comparison of their L-tuple frequenciesBiosystems, 1993
- Base compositional structure of genomesGenomics, 1992
- Structure and evolution of the lipase superfamily.Journal of Lipid Research, 1992
- Basic local alignment search toolJournal of Molecular Biology, 1990
- [5] Rapid and sensitive sequence comparison with FASTP and FASTAPublished by Elsevier ,1990
- The Genbank-Server at the University of HoustonNucleic Acids Research, 1990
- Effectiveness of measures requiring and not requiring prior sequence alignment for estimating the dissimilarity of natural sequencesJournal of Molecular Evolution, 1989
- Statistical method for rapid homology searchNucleic Acids Research, 1988
- A measure of the similarity of sets of sequences not requiring sequence alignment.Proceedings of the National Academy of Sciences, 1986