How to Interpret an Anonymous Bacterial Genome: Machine Learning Approach to Gene Identification

Open Access

1 November 1998

journal article
Published by Cold Spring Harbor Laboratory in Genome Research

Vol. 8 (11) , 1154-1171
https://doi.org/10.1101/gr.8.11.1154

Abstract

In this report we address the problem of accurate statistical modeling of DNA sequences, either coding or noncoding, for a bacterial species whose genome (or a large portion) was sequenced but not yet characterized experimentally. Availability of these models is critical for successful solution of the genome annotation task by statistical methods of gene finding. We present the method, GeneMark-Genesis, which learns the parameters of Markov models of protein-coding and noncoding regions from anonymous bacterial genomic sequence. These models are subsequently used in the GeneMark and GeneMark.hmm gene-finding programs. Although there is basically one model of a noncoding region for a given genome, several models of protein-coding region are automatically obtained by GeneMark-Genesis. The diversity of protein-coding models reflects the diversity of oligonucleotide compositions, particularly the diversity of codon usage strategies observed in genes from one and the same genome. In the simplest and the most important case, there are just two gene models—typical and atypical ones. We show that the atypical model allows one to predict genes that escape identification by the typical model. Many genes predicted by the atypical model appear to be horizontally transferred genes. The early versions of GeneMark-Genesis were used for annotating the genomes of Methanoccocus jannaschii and Helicobacter pylori. We report the results of accuracy testing of the full-scale version of GeneMark-Genesis on 10 completely sequenced bacterial genomes. Interestingly, the GeneMark.hmm program that employed the typical and atypical models defined by GeneMark-Genesis was able to predict 683 new atypical genes with 176 of them confirmed by similarity search.

Keywords

This publication has 26 references indexed in Scilit:

The Complete Genome Sequence of Escherichia coli K-12
Science, 1997
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
Prediction of complete gene structures in human genomic DNA
Journal of Molecular Biology, 1997
Complete Genome Sequence of the Methanogenic Archaeon, Methanococcus jannaschii
Science, 1996
Sequence Analysis of the Genome of the Unicellular Cyanobacterium Synechocystis sp. Strain PCC6803. II. Sequence Determination of the Entire Genome and Assignment of Potential Protein-coding Regions
DNA Research, 1996
Whole-Genome Random Sequencing and Assembly of Haemophilus influenzae Rd
Science, 1995
Constructing gene models from accurately predicted exons: an application of dynamic programming
Bioinformatics, 1994
Large scale bacterial gene discovery by similarity search
Nature Genetics, 1994
GENMARK: Parallel gene recognition for both DNA strands
Computers & Chemistry, 1993
Evidence for horizontal gene transfer in Escherichia coli speciation
Journal of Molecular Biology, 1991