Detection of new genes in a bacterial genome using Markov models for three gene classes

11 September 1995

journal article
research article
Published by Oxford University Press (OUP) in Nucleic Acids Research

Vol. 23 (17) , 3554-3562
https://doi.org/10.1093/nar/23.17.3554

Abstract

We further investigated the statistical features of the three classes of Escherichia coli genes that have been previously delineated by factorial correspondence analysis and dynamic clustering methods. A phased Markov model for a nucleotide sequence of each gene class was developed and employed for gene prediction using the GeneMark program. The protein-coding region prediction accuracy was determined for classspecific Markov models of different orders when the programs implementing these models were applied to gene sequences from the same or other classes. It is shown that at least two training sets and two program versions derived for different classes of E.coli genes are necessary in order to achieve a high accuracy of coding region prediction for uncharacterized sequences. Some annotated E.coli genes from Class I and Class III are shown to be spurious, whereas many open reading frames (ORFs) that have not been annotated in GenBank as genes are predicted to encode proteins. The amino acid sequences of the putative products of these ORFs initially did not show similarity to already known proteins. However, conserved regions have been identified in several of them by screening the latest entries in protein sequence databases and applying methods for motif search, while some other of these new genes have been identified in independent experiments.

Keywords

This publication has 22 references indexed in Scilit:

New genes in old sequence: a strategy for finding genes in the bacterial genome
Trends in Biochemical Sciences, 1994
A simple tool to search for sequence motifs that are conserved in BLAST outputs
Bioinformatics, 1994
Large scale bacterial gene discovery by similarity search
Nature Genetics, 1994
Issues in searching molecular sequence databases
Nature Genetics, 1994
DNA Sequence and Analysis of 136 Kilobases of the Escherichia coli Genome: Organizational Symmetry around the Origin of Replication
Genomics, 1993
Contamination of cDNA Sequences in Databases
Science, 1993
First and second moment of counts of words in random texts generated by Markov chains
Bioinformatics, 1992
Analysis of the Escherichia coli Genome: DNA Sequence of the Region from 84.5 to 86.5 Minutes
Science, 1992
Evidence for horizontal gene transfer in Escherichia coli speciation
Journal of Molecular Biology, 1991
Merging of distance matrices and classification by dynamic clustering
Bioinformatics, 1988