How to Interpret an Anonymous Bacterial Genome: Machine Learning Approach to Gene Identification
Open Access
- 1 November 1998
- journal article
- Published by Cold Spring Harbor Laboratory in Genome Research
- Vol. 8 (11) , 1154-1171
- https://doi.org/10.1101/gr.8.11.1154
Abstract
In this report we address the problem of accurate statistical modeling of DNA sequences, either coding or noncoding, for a bacterial species whose genome (or a large portion) was sequenced but not yet characterized experimentally. Availability of these models is critical for successful solution of the genome annotation task by statistical methods of gene finding. We present the method, GeneMark-Genesis, which learns the parameters of Markov models of protein-coding and noncoding regions from anonymous bacterial genomic sequence. These models are subsequently used in the GeneMark and GeneMark.hmm gene-finding programs. Although there is basically one model of a noncoding region for a given genome, several models of protein-coding region are automatically obtained by GeneMark-Genesis. The diversity of protein-coding models reflects the diversity of oligonucleotide compositions, particularly the diversity of codon usage strategies observed in genes from one and the same genome. In the simplest and the most important case, there are just two gene models—typical and atypical ones. We show that the atypical model allows one to predict genes that escape identification by the typical model. Many genes predicted by the atypical model appear to be horizontally transferred genes. The early versions of GeneMark-Genesis were used for annotating the genomes of Methanoccocus jannaschii and Helicobacter pylori. We report the results of accuracy testing of the full-scale version of GeneMark-Genesis on 10 completely sequenced bacterial genomes. Interestingly, the GeneMark.hmm program that employed the typical and atypical models defined by GeneMark-Genesis was able to predict 683 new atypical genes with 176 of them confirmed by similarity search.Keywords
This publication has 26 references indexed in Scilit:
- The Complete Genome Sequence of Escherichia coli K-12Science, 1997
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997
- Prediction of complete gene structures in human genomic DNAJournal of Molecular Biology, 1997
- Complete Genome Sequence of the Methanogenic Archaeon, Methanococcus jannaschii Science, 1996
- Sequence Analysis of the Genome of the Unicellular Cyanobacterium Synechocystis sp. Strain PCC6803. II. Sequence Determination of the Entire Genome and Assignment of Potential Protein-coding RegionsDNA Research, 1996
- Whole-Genome Random Sequencing and Assembly of Haemophilus influenzae RdScience, 1995
- Constructing gene models from accurately predicted exons: an application of dynamic programmingBioinformatics, 1994
- Large scale bacterial gene discovery by similarity searchNature Genetics, 1994
- GENMARK: Parallel gene recognition for both DNA strandsComputers & Chemistry, 1993
- Evidence for horizontal gene transfer in Escherichia coli speciationJournal of Molecular Biology, 1991