Distinguishing protein-coding and noncoding genes in the human genome
Top Cited Papers
- 4 December 2007
- journal article
- research article
- Published by Proceedings of the National Academy of Sciences in Proceedings of the National Academy of Sciences
- Vol. 104 (49) , 19428-19433
- https://doi.org/10.1073/pnas.0709013104
Abstract
Although the Human Genome Project was completed 4 years ago, the catalog of human protein-coding genes remains a matter of controversy. Current catalogs list a total of ≈24,500 putative protein-coding genes. It is broadly suspected that a large fraction of these entries are functionally meaningless ORFs present by chance in RNA transcripts, because they show no evidence of evolutionary conservation with mouse or dog. However, there is currently no scientific justification for excluding ORFs simply because they fail to show evolutionary conservation: the alternative hypothesis is that most of these ORFs are actually valid human genes that reflect gene innovation in the primate lineage or gene loss in the other lineages. Here, we reject this hypothesis by carefully analyzing the nonconserved ORFs—specifically, their properties in other primates. We show that the vast majority of these ORFs are random occurrences. The analysis yields, as a by-product, a major revision of the current human catalogs, cutting the number of protein-coding genes to ≈20,500. Specifically, it suggests that nonconserved ORFs should be added to the human gene catalog only if there is clear evidence of an encoded protein. It also provides a principled methodology for evaluating future proposed additions to the human gene catalog. Finally, the results indicate that there has been relatively little true innovation in mammalian protein-coding genes.Keywords
This publication has 16 references indexed in Scilit:
- Revisiting the protein-coding gene catalog ofDrosophila melanogasterusing 12 fly genomesGenome Research, 2007
- Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot projectNature, 2007
- NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteinsNucleic Acids Research, 2007
- Ensembl 2007Nucleic Acids Research, 2006
- Phylogenetic Reconstruction of Orthology, Paralogy, and Conserved Synteny for Dog and HumanPLoS Computational Biology, 2006
- Pfam: clans, web tools and servicesNucleic Acids Research, 2006
- Genome sequence, comparative analysis and haplotype structure of the domestic dogNature, 2005
- Transcriptional Maps of 10 Human Chromosomes at 5-Nucleotide ResolutionScience, 2005
- The male-specific region of the human Y chromosome is a mosaic of discrete sequence classesNature, 2003
- Initial sequencing and comparative analysis of the mouse genomeNature, 2002