A fast word search algorithm for the representation of sequence similarity in genomic DNA

Open Access

1 January 1994

journal article
Published by Oxford University Press (OUP) in Nucleic Acids Research

Vol. 22 (3) , 404-411
https://doi.org/10.1093/nar/22.3.404

Abstract

Representation of sequence similarity by dot matrix plots is a method widely used for comparing biological sequences. The user Is presented with an overall view of similarity between two sequences. Computation of this plot has been reconsidered here. An improvement Is proposed through the preprocessing of the data into an automaton recognizing the word structure of a sequence. The main advantage of this approach is to systematically eliminate the repetitions during word comparison. Simple heuristics are also considered to greatly speed up pattern matching. As a result, large sequences are handled very efficiently. This is illustrated by a comparison of large genomic DNA. The algorithm has been implemented in an interactive application on a microcomputer.

Keywords

This publication has 18 references indexed in Scilit:

Locating well-conserved regions within a pairwise alignment
Bioinformatics, 1993
Rapid similarity searches of nucleic acid and protein data banks.
Proceedings of the National Academy of Sciences, 1983
An interactive graphics program for comparing and aligning nucleic acid and amino acid sequences
Nucleic Acids Research, 1982
Two-dimensional graphic analysis of DNA sequence homologies
Nucleic Acids Research, 1982
Matrix program to analyze primary structure homology
Nucleic Acids Research, 1982
Enhanced graphic matrix analysis of nucleic acid and protein sequences.
Proceedings of the National Academy of Sciences, 1981
Three cDNA clones encoding mouse transplantation antigens: Homology to immunoglobulin genes
Cell, 1981
Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c551
Journal of Molecular Biology, 1971
The Diagram, a Method for Comparing Sequences
European Journal of Biochemistry, 1970
Locating gaps in amino acid sequences to optimize the homology between two proteins
Biochemical Genetics, 1969