Probabilistic and Statistical Properties of Words: An Overview
Top Cited Papers
- 1 February 2000
- journal article
- research article
- Published by Mary Ann Liebert Inc in Journal of Computational Biology
- Vol. 7 (1-2) , 1-46
- https://doi.org/10.1089/10665270050081360
Abstract
In the following, an overview is given on statistical and probabilistic properties of words, as occurring in the analysis of biological sequences. Counts of occurrence, counts of clumps, and renewal counts are distinguished, and exact distributions as well as normal approximations, Poisson process approximations, and compound Poisson approximations are derived. Here, a sequence is modelled as a stationary ergodic Markov chain; a test for determining the appropriate order of the Markov chain is described. The convergence results take the error made by estimating the Markovian transition probabilities into account. The main tools involved are moment generating functions, martingales, Stein’s method, and the Chen-Stein method. Similar results are given for occurrences of multiple patterns, and, as an example, the problem of unique recoverability of a sequence from SBH chip data is discussed. Special emphasis lies on disentangling the complicated dependence structure between word occurrences, due to self-overlap as well as due to overlap between words. The results can be used to derive approximate, and conservative, confidence intervals for tests.Keywords
This publication has 42 references indexed in Scilit:
- Solving the Stein Equation in compound poisson approximationAdvances in Applied Probability, 1998
- Compound Poisson and Poisson Process Approximations for Occurrences of Multiple Words in Markov ChainsJournal of Computational Biology, 1998
- Compound Poisson approximations for word patterns under Markovian hypothesesJournal of Applied Probability, 1995
- DNA physical mapping and alternating Eulerian cycles in colored graphsAlgorithmica, 1995
- Algorithm AS 291: The Distribution of the Frequency of Subsequences in Alphabetic Sequences, as Exemplified by Deoxyribonucleic AcidJournal of the Royal Statistical Society Series C: Applied Statistics, 1994
- Approximate string-matching with q-grams and maximal matchesTheoretical Computer Science, 1992
- Two Moments Suffice for Poisson Approximations: The Chen-Stein MethodThe Annals of Probability, 1989
- Renewal theory for several patternsJournal of Applied Probability, 1985
- A Martingale Approach to the Study of Occurrence of Sequence Patterns in Repeated ExperimentsThe Annals of Probability, 1980
- Success runs in a two-state Markov chainJournal of Applied Probability, 1974