ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes

15 March 2003

journal article
research article
Published by Oxford University Press (OUP) in Nucleic Acids Research

Vol. 31 (6) , 1780-1789
https://doi.org/10.1093/nar/gkg254

Abstract

A new system, ZCURVE 1.0, for finding protein- coding genes in bacterial and archaeal genomes has been proposed. The current algorithm, which is based on the Z curve representation of the DNA sequences, lays stress on the global statistical features of protein-coding genes by taking the frequencies of bases at three codon positions into account. In ZCURVE 1.0, since only 33 parameters are used to characterize the coding sequences, it gives better consideration to both typical and atypical cases, whereas in Markov-model-based methods, e.g. Glimmer 2.02, thousands of parameters are trained, which may result in less adaptability. To compare the performance of the new system with that of Glimmer 2.02, both systems were run, respectively, for 18 genomes not annotated by the Glimmer system. Comparisons were also performed for predicting some function-known genes by both systems. Consequently, the average accuracy of both systems is well matched; however, ZCURVE 1.0 has more accurate gene start prediction, lower additional prediction rate and higher accuracy for the prediction of horizontally transferred genes. It is shown that the joint applications of both systems greatly improve gene-finding results. For a typical genome, e.g. Escherichia coli, the system ZCURVE 1.0 takes similar to2 min on a Pentium III 866 PC without any human intervention. The system ZCURVE 1.0 is freely available at: http://tubic. tju.edu.cn/Zcurve_B/.

Keywords

This publication has 21 references indexed in Scilit:

HGT-DB: a database of putative horizontally transferred genes in prokaryotic complete genomes
Nucleic Acids Research, 2003
A probabilistic method for identifying start codons in bacterial genomes
Bioinformatics, 2001
Identification of protein‐coding genes in the genome of Vibrio cholerae with more than 98% accuracy using occurrence frequencies of single nucleotides
European Journal of Biochemistry, 2001
GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions
Nucleic Acids Research, 2001
A Novel Bacterial Gene-Finding System with Improved Accuracy in Locating Start Codons
DNA Research, 2001
Complete genome sequence of Pseudomonas aeruginosa PAO1, an opportunistic pathogen
Nature, 2000
Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve
Nucleic Acids Research, 2000
EcoGene: a genome sequence database for Escherichia coli K-12
Nucleic Acids Research, 2000
Improved microbial gene identification with GLIMMER
Nucleic Acids Research, 1999
Bacterial start site prediction
Nucleic Acids Research, 1999