Exceptional Motifs in Different Markov Chain Models for a Statistical Analysis of DNA Sequences
- 1 January 1995
- journal article
- research article
- Published by Mary Ann Liebert Inc in Journal of Computational Biology
- Vol. 2 (3) , 417-437
- https://doi.org/10.1089/cmb.1995.2.417
Abstract
Identifying exceptional motifs is often used for extracting information from long DNA sequences. The two difficulties of the method are the choice of the model that defines the expected frequencies of words and the approximation of the variance of the difference T(W) between the number of occurrences of a word W and its estimation. We consider here different Markov chain models, either with stationary or periodic transition probabilities. We estimate the variance of the difference T(W) by the conditional variance of the number of occurrences of W given the oligonucleotides counts that define the model. Two applications show how to use asymptotically standard normal statistics associated with the counts to describe a given sequence in terms of its outlying words. Sequences of Escherichia coli and of Bacillus subtilis are compared with respect to their exceptional tri- and tetranucleotides. For both bacteria, exceptional 3-words are mainly found in the coding frame. E. coli palindrome counts are analyzed in different models, showing that many overabundant words are one-letter mutations of avoided palindromes.Keywords
This publication has 25 references indexed in Scilit:
- Hidden Markov chains and the analysis of genome structureComputers & Chemistry, 1992
- Over- and under-representation of short oligonucleotides in DNA sequences.Proceedings of the National Academy of Sciences, 1992
- Expected frequencies of DNA patterns using whittle's formulaJournal of Applied Probability, 1991
- A model of DNA sequence evolutionBulletin of Mathematical Biology, 1990
- Periodicities in coding and noncoding regions of the genesJournal of Theoretical Biology, 1990
- A limit theorem for the number of non-overlapping occurrences of a pattern in a sequence of independent trialsJournal of Applied Probability, 1988
- The analysis of intron data and their use in the detection of short signalsJournal of Molecular Evolution, 1987
- Intervening Sequences Exhibit Distinct VocabularyJournal of Biomolecular Structure and Dynamics, 1986
- Linguistics of Nucleotide Sequences: Morphology and Comparison of VocabulariesJournal of Biomolecular Structure and Dynamics, 1986
- Markov chain analysis finds a significant influence of neighboring bases on the occurrence of a base in eucaryotic nuclear DNA sequences both protein-coding and noncodingJournal of Molecular Evolution, 1985