Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training
Top Cited Papers
- 29 August 2008
- journal article
- Published by Cold Spring Harbor Laboratory in Genome Research
- Vol. 18 (12) , 1979-1990
- https://doi.org/10.1101/gr.081612.108
Abstract
We describe a new ab initio algorithm, GeneMark-ES version 2, that identifies protein-coding genes in fungal genomes. The algorithm does not require a predetermined training set to estimate parameters of the underlying hidden Markov model (HMM). Instead, the anonymous genomic sequence in question is used as an input for iterative unsupervised training. The algorithm extends our previously developed method tested on genomes of Arabidopsis thaliana, Caenorhabditis elegans, and Drosophila melanogaster. To better reflect features of fungal gene organization, we enhanced the intron submodel to accommodate sequences with and without branch point sites. This design enables the algorithm to work equally well for species with the kinds of variations in splicing mechanisms seen in the fungal phyla Ascomycota, Basidiomycota, and Zygomycota. Upon self-training, the intron submodel switches on in several steps to reach its full complexity. We demonstrate that the algorithm accuracy, both at the exon and the whole gene level, is favorably compared to the accuracy of gene finders that employ supervised training. Application of the new method to known fungal genomes indicates substantial improvement over existing annotations. By eliminating the effort necessary to build comprehensive training sets, the new algorithm can streamline and accelerate the process of annotation in a large number of fungal genome sequencing projects.Keywords
This publication has 39 references indexed in Scilit:
- Using native and syntenically mapped cDNA alignments to improve de novo gene findingBioinformatics, 2008
- Identifying bacterial genes and endosymbiont DNA with GlimmerBioinformatics, 2007
- CDD: a conserved domain database for interactive domain family analysisNucleic Acids Research, 2006
- Sensitivity of hidden Markov modelsJournal of Applied Probability, 2005
- Assignment of SRY, ANT3, and CSF2RA to the Bovine Y Chromosome by FISH and RH MappingAnimal Biotechnology, 2004
- Gene prediction and verification in a compact genome with numerous small intronsGenome Research, 2004
- Introns and Splicing Elements of Five Diverse FungiEukaryotic Cell, 2004
- BLAT—The BLAST-Like Alignment ToolGenome Research, 2002
- Evaluation of gene prediction software using a genomic data set: application to Arabidopsis thalianasequencesBioinformatics, 1999
- Prediction of complete gene structures in human genomic DNAJournal of Molecular Biology, 1997