Significantly lower entropy estimates for natural DNA sequences
- 22 November 2002
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- No. 10680314,p. 151-160
- https://doi.org/10.1109/dcc.1997.581998
Abstract
If DNA were a random string over its alphabet {A,C,G,T}, an optimal code would assign 2 bits to each nucleotide. We imagine DNA to be a highly ordered, purposeful molecule, and might therefore reasonably expect statistical models of its string representation to produce much lower entropy estimates. Surprisingly this has not been the case for many natural DNA sequences, including portions of the human genome. We introduce a new statistical model (compression algorithm), the strongest reported to date, for naturally occurring DNA sequences. Conventional techniques code a nucleotide using only slightly fewer bits (1.90) than one obtains by relying only on the frequency statistics of individual nucleotides (1.95). Our method in some cases increases this gap by more than five-fold (1.66) and may lead to better performance in microbiological pattern recognition applications. One of our main contributions, and the principle source of these improvements, is the formal inclusion of inexact match information in the model. The existence of matches at various distances forms a panel of experts which are then combined into a single prediction. The structure of this combination is novel and its parameters are learned using expectation maximization (EM).Keywords
This publication has 7 references indexed in Scilit:
- A new challenge for compression algorithms: Genetic sequencesPublished by Elsevier ,2002
- Elements of Information TheoryPublished by Wiley ,2001
- Linguistic Features of Noncoding DNA SequencesPhysical Review Letters, 1994
- Complete DNA sequence of yeast chromosome XINature, 1994
- Maximum Likelihood from Incomplete Data Via the EM AlgorithmJournal of the Royal Statistical Society Series B: Statistical Methodology, 1977
- A universal algorithm for sequential data compressionIEEE Transactions on Information Theory, 1977
- An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecologyBulletin of the American Mathematical Society, 1967