The Erdos-Renyi Law in Distribution, for Coin Tossing and Sequence Matching
Open Access
- 1 June 1990
- journal article
- Published by Institute of Mathematical Statistics in The Annals of Statistics
- Vol. 18 (2) , 539-570
- https://doi.org/10.1214/aos/1176347615
Abstract
We study approximations to the distribution of counts of matches in the best matching segment of specified length when comparing two long sequences of i.i.d. letters. The key tools used are large-deviation inequali- ties and the Chen-Stein method of Poisson approximation. The origin of the problem in molecular biology is indicated. 1. Introduction. A strand of DNA can be represented as a long string of letters from the four-letter alphabet {u, c, g, t). Currently, a large amount of laboratory effort is being expended in the determination and subsequent compilation of genetic information from various organisms. This information consists of listings of these long strings. A natural question arises from comparison of two or more such strings, by biologists' efforts to determine when a comparison detects an unusual congruence shared among the com- pared strings. Such statistical problems are naturally cast in the usual hypoth- esis-testing context, in which we need to compute the tail probability (the biologists' p-value) for a seemingly unusual event. The work we report here is motivated by the scientific desire to compute the sort of tail probabilities of interest to molecular biologists in their evaluation of closely matching regions of different biological sequences. Until recently, the standard tool used in computing tail probabilities was a probabilistic use of the Bonferroni inequalities as pioneered in Watson (1954). Such calculations essentially establish a Poisson approximation for the distribution of counts of weakly dependent rare events. See, for example, the moment calculations in Karlin and Ost (1987) and the discussion in Karlin, Ghandour Ost, Tavare and Korn (1983). Use of the Bonferroni inequalities requires computation of moments of arbitrarily large order; the task is always tedious and frequently technically demanding. A promising alternative to using Bonferroni methods to establish the Poisson approximation for dependent events is to use methods developed in Chen (1975) and Stein (1986). In Arratia, Goldstein and Gordon (1989), the Chen-Stein method of Poisson approximation is generalized to a multivariate context, and various examples relevant to sequence matching are presented. Indeed, the realization that the results of Arratia, Gordon and Waterman (1986) can be obtained without the high-order moment calculations required by Bonferroni methods has enabled us to cope successfully with problemsKeywords
This publication has 0 references indexed in Scilit: