An accurate approximation to the distribution of the length of the longest matching word between two random DNA sequences

Abstract
An accurate approximation is derived to the distribution of the length of the longest matching word present between two random DNA sequences of finite length, using only elementary probability arguments. The distribution is shown to be consistent with previous asymptotic results for the mean and variance of longest common words. The application of the distribution to assessing the statistical significance of sequence similarities is considered. It is shown how the distribution can be modified to take account of non-independence of neighbouring bases in real sequences.