An accurate approximation to the distribution of the length of the longest matching word between two random DNA sequences
- 1 November 1990
- journal article
- conference paper
- Published by Springer Nature in Bulletin of Mathematical Biology
- Vol. 52 (6) , 773-784
- https://doi.org/10.1007/bf02460808
Abstract
An accurate approximation is derived to the distribution of the length of the longest matching word present between two random DNA sequences of finite length, using only elementary probability arguments. The distribution is shown to be consistent with previous asymptotic results for the mean and variance of longest common words. The application of the distribution to assessing the statistical significance of sequence similarities is considered. It is shown how the distribution can be modified to take account of non-independence of neighbouring bases in real sequences.This publication has 11 references indexed in Scilit:
- Identification of common molecular subsequencesPublished by Elsevier ,2004
- Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.Proceedings of the National Academy of Sciences, 1990
- A test for the statistical significance of DNA sequence similarities for application in databank searchesBioinformatics, 1989
- An Extreme Value Theory for Sequence MatchingThe Annals of Statistics, 1986
- The statistical distribution of nucleic acid similaritiesNucleic Acids Research, 1985
- An Erdös-Rényi law with shiftsAdvances in Mathematics, 1985
- A comprehensive set of sequence analysis programs for the VAXNucleic Acids Research, 1984
- New approaches for computer analysis of nucleic acid sequences.Proceedings of the National Academy of Sciences, 1983
- Statistical characterization of nucleic acid sequence functional domainsNucleic Acids Research, 1983