Extendable words in nucleotide sequences

1 April 1992

journal article
research article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 8 (2) , 129-135
https://doi.org/10.1093/bioinformatics/8.2.129

Abstract

Previous statistical analyses revealed several peculiarities of nucleotide sequences that preclude their description by existing models and thus allow one to distinguish DNA and RNA sequences from random A, T, G, C-texts. This is a consequence of the unusual distribution of certain words in nucleotide sequences: while the distribution of (most) words is consistent with Markov models of small orders, the distribution of certain words cannot be described by any previous model (anomalies in distribution of homonucleotide/homopurine/homopyrimidine runs, complementary and mirror palindromes, and non–stationary words). In this work we introduce a probabilistic approach that is partly motivated by analogy with linguistics. We also describe another important feature of DNA/RNA sequences: anomalies in distribution of words of poor nucleotide composition. We show that some classes of these words are the major obstacle for the simple Markov description of nucleotide sequences.

Keywords

This publication has 0 references indexed in Scilit: