Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve
Open Access
- 15 July 2000
- journal article
- research article
- Published by Oxford University Press (OUP) in Nucleic Acids Research
- Vol. 28 (14) , 2804-2814
- https://doi.org/10.1093/nar/28.14.2804
Abstract
The Z curve is a three-dimensional space curve constituting the unique representation of a given DNA sequence in the sense that each can be uniquely reconstructed from the other. Based on the Z curve, a new protein coding gene-finding algorithm specific for the yeast genome at better than 95% accuracy has been proposed. Six cross-validation tests were performed to confirm the above accuracy. Using the new algorithm, the number of protein coding genes in the yeast genome is re-estimated. The estimate is based on the assumption that the unknown genes have similar statistical properties to the known genes. It is found that the number of protein coding genes in the 16 yeast chromosomes is ≤5645, significantly smaller than the 5800–6000 which is widely accepted, and much larger than the 4800 estimated by another group recently. The mitochondrial genes were not included into the above estimate. A codingness index called the YZ score (YZ Œ [0,1]) is proposed to recognize protein coding genes in the yeast genome. Among the ORFs annotated in the MIPS (Munich Information Centre for Protein Sequences) database, those recognized as non-coding by the present algorithm are listed in this paper in detail. The criterion for a coding or non-coding ORF is simply decided by YZ > 0.5 or YZ < 0.5, respectively. The YZ scores for all the ORFs annotated in the MIPS database have been calculated and are available on request by sending email to the corresponding author.Keywords
This publication has 18 references indexed in Scilit:
- Origin and properties of non-coding ORFs in the yeast genomeNucleic Acids Research, 1999
- Bioinformatics and the discovery of gene functionTrends in Genetics, 1996
- The yeast genome project: what did we learn?Trends in Genetics, 1996
- Evaluation of Gene Structure Prediction ProgramsGenomics, 1996
- Complete DNA sequence of yeast chromosome XINature, 1994
- A Graphic Approach to Analyzing Codon Usage in 1562 Escherichia coli Protein Coding SequencesJournal of Molecular Biology, 1994
- Z Curves, An Intutive Tool for Visualizing and Analyzing the DNA SequencesJournal of Biomolecular Structure and Dynamics, 1994
- Analysis of distribution of bases in the coding sequences by a digrammatic techniqueNucleic Acids Research, 1991
- The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applicationsNucleic Acids Research, 1987
- CODON SELECTION IN YEAST1982