A model of DNA sequence evolution

Abstract
Statistical studies of gene populations on the purine/pyrimidine alphabet have shown that the mean occurrence probability of thei-motif YRY(N) i YRY (R=purine, Y=pyrimidine, N=R or Y) is not uniform by varyingi in the range [1,99], but presents a maximum ati=6 in the following populations: protein coding genes of eukaryotes, prokaryotes, chloroplasts and mitrochondria, and also viral introns, ribosomal RNA genes and transfer RNA genes (Arquès and Michel, 1987b,J. theor. Biol. 128, 457–461). From the “universality” of this observation, we suggested that the oligonucleotide YRY(N)6 is a primitive one and that it has a central function in DNA sequence evolution (Arquès and Michel, 1987b,J. theor. Biol. 128, 457–461). Following this idea, we introduce a concept of a model of DNA sequence evolution which will be validated according to a shema presented in three parts. In the first part, using the last version of the gene database, the YRY(N)6YRY preferential occurrence (maximum ati=6) is confirmed for the populations mentioned above and is extended to some newly analysed populations: chloroplast introns, chloroplast 5′ regions, mitochondrial 5′ regions and small nuclear RNA genes. On the other hand, the YRY(N)6YRY preferential occurrence and periodicities are used in order to classify 18 gene populations. In the second part, we will demonstrate that several statistical features characterizing different gene populations (in particular the YRY(N)6YRY preferential occurrence and the periodicities) can be retrieved from a simple Markov model based on the mixing of the two oligonucleotides YRY(N)6 and YRY(N)3 and based on the percentages of RYR and YRY in the unspecified trinucleotides (N)3 of YRY(N)6 and YRY(N)3. Several properties are identified and prove in particular that the oligonucleotide mixing is an independent process and that several different features are functions of a unique parameter. In the third part, the return of the model to the reality shows a strong correlation between reality and simulation concerning the presence of large alternating purine/pyrimidine stretches and of periodicities. It also contributes to a greater understanding of biological reality, e.g. the presence or the absence of large alternating purine/pyrimidine stretches can be explained as being a simple consequence of the mixing of two particular oligonucleotides. Finally, we believe that such an approach is the first step toward a unified model of DNA sequence evolution allowing the molecular understanding of both the origin of life and the actual biological reality.