ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes
- 15 March 2003
- journal article
- research article
- Published by Oxford University Press (OUP) in Nucleic Acids Research
- Vol. 31 (6) , 1780-1789
- https://doi.org/10.1093/nar/gkg254
Abstract
A new system, ZCURVE 1.0, for finding protein- coding genes in bacterial and archaeal genomes has been proposed. The current algorithm, which is based on the Z curve representation of the DNA sequences, lays stress on the global statistical features of protein-coding genes by taking the frequencies of bases at three codon positions into account. In ZCURVE 1.0, since only 33 parameters are used to characterize the coding sequences, it gives better consideration to both typical and atypical cases, whereas in Markov-model-based methods, e.g. Glimmer 2.02, thousands of parameters are trained, which may result in less adaptability. To compare the performance of the new system with that of Glimmer 2.02, both systems were run, respectively, for 18 genomes not annotated by the Glimmer system. Comparisons were also performed for predicting some function-known genes by both systems. Consequently, the average accuracy of both systems is well matched; however, ZCURVE 1.0 has more accurate gene start prediction, lower additional prediction rate and higher accuracy for the prediction of horizontally transferred genes. It is shown that the joint applications of both systems greatly improve gene-finding results. For a typical genome, e.g. Escherichia coli, the system ZCURVE 1.0 takes similar to2 min on a Pentium III 866 PC without any human intervention. The system ZCURVE 1.0 is freely available at: http://tubic. tju.edu.cn/Zcurve_B/.Keywords
This publication has 21 references indexed in Scilit:
- HGT-DB: a database of putative horizontally transferred genes in prokaryotic complete genomesNucleic Acids Research, 2003
- A probabilistic method for identifying start codons in bacterial genomesBioinformatics, 2001
- Identification of protein‐coding genes in the genome of Vibrio cholerae with more than 98% accuracy using occurrence frequencies of single nucleotidesEuropean Journal of Biochemistry, 2001
- GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regionsNucleic Acids Research, 2001
- A Novel Bacterial Gene-Finding System with Improved Accuracy in Locating Start CodonsDNA Research, 2001
- Complete genome sequence of Pseudomonas aeruginosa PAO1, an opportunistic pathogenNature, 2000
- Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curveNucleic Acids Research, 2000
- EcoGene: a genome sequence database for Escherichia coli K-12Nucleic Acids Research, 2000
- Improved microbial gene identification with GLIMMERNucleic Acids Research, 1999
- Bacterial start site predictionNucleic Acids Research, 1999