Extendable words in nucleotide sequences

Abstract
Previous statistical analyses revealed several peculiarities of nucleotide sequences that preclude their description by existing models and thus allow one to distinguish DNA and RNA sequences from random A, T, G, C-texts. This is a consequence of the unusual distribution of certain words in nucleotide sequences: while the distribution of (most) words is consistent with Markov models of small orders, the distribution of certain words cannot be described by any previous model (anomalies in distribution of homonucleotide/homopurine/homopyrimidine runs, complementary and mirror palindromes, and non–stationary words). In this work we introduce a probabilistic approach that is partly motivated by analogy with linguistics. We also describe another important feature of DNA/RNA sequences: anomalies in distribution of words of poor nucleotide composition. We show that some classes of these words are the major obstacle for the simple Markov description of nucleotide sequences.

This publication has 0 references indexed in Scilit: