On the Distribution ofK-tuple Matches for Sequence Homology: A Constant Time Exact Calculation of the Variance
- 1 January 1998
- journal article
- research article
- Published by Mary Ann Liebert Inc in Journal of Computational Biology
- Vol. 5 (1) , 87-100
- https://doi.org/10.1089/cmb.1998.5.87
Abstract
We study the distribution of a statistic useful in calculating the significance of the number of k-tuple matches detected in biological sequence homology algorithms. The statistic is Rn,k, the total number of heads in head runs of length k or more in a sequence of iid Bernoulli trials of length n. Calculation of the mean is straightforward. Poisson approximation formulas have been used for the variance because they are simple and powerful. Unfortunately, when p = P(Head) is large, the Poisson approximation no longer works well. In our application, p is large, say.75, and we have turned instead to direct calculation of the variance. Surprisingly, we are able to show that the variance, which is based on the interactions of O(n2) random variables, can be computed in constant time, independent of the length of the sequence and probability p. This result can be used to calculate the mean and variance of a number of other head run statistics in constant time. Additionally, we show how to extend the result to sequences generated by a stationary Markov process where the variance can be calculated in O(n) time.Keywords
This publication has 22 references indexed in Scilit:
- Minisatellite diversity supports a recent African origin for modern humansNature Genetics, 1996
- Friedreich's Ataxia: Autosomal Recessive Disease Caused by an Intronic GAA Triplet Repeat ExpansionScience, 1996
- Distribution Theory of Runs: A Markov Chain ApproachJournal of the American Statistical Association, 1994
- A space efficient algorithm for finding the best non-overlapping alignment scorePublished by Springer Nature ,1994
- On the Number of Overlapping Success Runs in a Sequence of Independent Bernoulli TrialsPublished by Springer Nature ,1993
- Poisson, compound poisson and process approximations for testing statistical significance in sequence comparisonsBulletin of Mathematical Biology, 1992
- An Unstable Triplet Repeat in a Gene Related to Myotonic Muscular DystrophyScience, 1992
- Genetic variation at five trimeric and tetrameric tandem repeat loci in four human population groupsGenomics, 1992
- Poisson Approximation and the Chen-Stein MethodStatistical Science, 1990
- Specific formulae for some success run distributionsStatistics & Probability Letters, 1990