An accurate approximation to the distribution of the length of the longest matching word between two random DNA sequences

1 November 1990

journal article
conference paper
Published by Springer Nature in Bulletin of Mathematical Biology

Vol. 52 (6) , 773-784
https://doi.org/10.1007/bf02460808

Abstract

An accurate approximation is derived to the distribution of the length of the longest matching word present between two random DNA sequences of finite length, using only elementary probability arguments. The distribution is shown to be consistent with previous asymptotic results for the mean and variance of longest common words. The application of the distribution to assessing the statistical significance of sequence similarities is considered. It is shown how the distribution can be modified to take account of non-independence of neighbouring bases in real sequences.

This publication has 11 references indexed in Scilit:

Identification of common molecular subsequences
Published by Elsevier ,2004
Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.
Proceedings of the National Academy of Sciences, 1990
A test for the statistical significance of DNA sequence similarities for application in databank searches
Bioinformatics, 1989
An Extreme Value Theory for Sequence Matching
The Annals of Statistics, 1986
The statistical distribution of nucleic acid similarities
Nucleic Acids Research, 1985
An Erdös-Rényi law with shifts
Advances in Mathematics, 1985
A comprehensive set of sequence analysis programs for the VAX
Nucleic Acids Research, 1984
New approaches for computer analysis of nucleic acid sequences.
Proceedings of the National Academy of Sciences, 1983
Statistical characterization of nucleic acid sequence functional domains
Nucleic Acids Research, 1983