Statistical method for predicting protein coding regions in nucleic acid sequences

1 November 1987

journal article
research article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 3 (4) , 287-295
https://doi.org/10.1093/bioinformatics/3.4.287

Abstract

Protein coding regions of a genome fragment can be mathematically predicted by studying variations in the statistical properties or by searching the signals characteristic of the junctions between the coding and non-coding regions. We propose here a new statistical method using correspondence analysis. This method does not use any reference codon set but takes into account the codon usage homogeneity along the studied genome fragment. Comparison with previously published methods especially the ‘codon usage method’ of Staden has been made, and two examples are presented here. Applications to analysis of prokaryotic operon and eukaryotic split genes are also discussed. Use of the method has also shown two structures not previously described: i) in the human prt gene, a strong triplet structure exists in a non-coding region; ii) in the human tp-a codon usage is not uniform between the different exons

This publication has 6 references indexed in Scilit:

Nucleotide distribution and the recognition of coding regions in DNA sequences: An information theory approach
Journal of Theoretical Biology, 1985
Application of learning techniques to splicing site recognition
Biochimie, 1985
Conservation of RNA secondary structures in two intron families including mitochondrial-, chloroplast- and nuclear-encoded members.
The EMBO Journal, 1983
Nucleotide sequence of bacteriophage λ DNA
Journal of Molecular Biology, 1982
Single base substitution in an intron of oxidase gene compensates splicing defects of the cytochrome b gene
Nature, 1982
Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification.
Proceedings of the National Academy of Sciences, 1981