Significantly Lower Entropy Estimates for Natural DNA Sequences

1 January 1999

journal article
Published by Mary Ann Liebert Inc in Journal of Computational Biology

Vol. 6 (1) , 125-142
https://doi.org/10.1089/cmb.1999.6.125

Abstract

If DNA were a random string over its alphabet {A, C, G, T}, an optimal code would assign two bits to each nucleotide. DNA may be imagined to be a highly ordered, purposeful molecule, and one might therefore reasonably expect statistical models of its string representation to produce much lower entropy estimates. Surprisingly, this has not been the case for many natural DNA sequences, including portions of the human genome. We introduce a new statistical model (compression algorithm), the strongest reported to date, for naturally occurring DNA sequences. Conventional techniques code a nucleotide using only slightly fewer bits (1.90) than one obtains by relying only on the frequency statistics of individual nucleotides (1.95). Our method in some cases increases this gap by more than fivefold (1.66) and may lead to better performance in microbiological pattern recognition applications. One of our main contributions, and the principle source of these improvements, is the formal inclusion of inexact match information in the model. The existence of matches at various distances forms a panel of experts which are then combined into a single prediction. The structure of this combination is novel and its parameters are learned using Expectation Maximization (EM). Experiments are reported using a wide variety of DNA sequences and compared whenever possible with earlier work. Four reasonable notions for the string distance function used to identify near matches, are implemented and experimentally compared. We also report lower entropy estimates for coding regions extracted from a large collection of nonredundant human genes. The conventional estimate is 1.92 bits. Our model produces only slightly better results (1.91 bits) when considering nucleotides, but achieves 1.84-1.87 bits when the prediction problem is divided into two stages: (i) predict the next amino acid-based on inexact polypeptide matches, and (ii) predict the particular codon. Our results suggest that matches at the amino acid level play some role, but a small one, in determining the statistical structure of nonredundant coding sequences.

Keywords

This publication has 11 references indexed in Scilit:

A new challenge for compression algorithms: Genetic sequences
Published by Elsevier ,2002
Linguistic Features of Noncoding DNA Sequences
Physical Review Letters, 1994
Complete DNA sequence of yeast chromosome XI
Nature, 1994
A hidden Markov model that finds genes inE.coliDNA
Nucleic Acids Research, 1994
A maximum entropy principle for the distribution of local complexity in naturally occurring nucleotide sequences
Computers & Chemistry, 1992
Entropies of coding and noncoding sequences of DNA and proteins
Biophysical Chemistry, 1992
Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments
Journal of Molecular Biology, 1992
Characterization of nucleotidic sequences using maximum entropy techniques
Journal of Theoretical Biology, 1990
A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains
The Annals of Mathematical Statistics, 1970
An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology
Bulletin of the American Mathematical Society, 1967