Genome annotation errors in pathway databases due to semantic ambiguity in partial EC numbers
Open Access
- 21 July 2005
- journal article
- research article
- Published by Oxford University Press (OUP) in Nucleic Acids Research
- Vol. 33 (13) , 4035-4039
- https://doi.org/10.1093/nar/gki711
Abstract
We report on a new type of systematic annota- tion error in genome and pathway databases that results from the misinterpretation of partial Enzyme Commission (EC) numbers such as '1.1.1.-'. This error results in the assignment of genes annotated with a partial EC number to many or all biochemical reactions that are annotated with the same partial EC number.Thatinference isfaulty because ofthe ambigu- ous nature of partial EC numbers. We have observed this type of error in multiple databases, including KEGG, VIMSS and IMG, all of which assign genes to KEGG pathways. The Escherichia coli subset of the KEGG database exhibits this error for 6.8% of its gene-reaction assignments. For example, KEGG contains 17 reactions that are annotated with EC 1.1.1.-. A group of three E.coli genes, b1580 (putative dehydrogenase, NAD(P)-binding, starvation-sensing protein), b3787 (UDP-N-acetyl-D-mannosaminuronic acid dehydrogenase) and b0207 (2,5-diketo-D-glucon- ate reductase B), is assigned to 15 of those reactions, despite experimental evidence indicating different sin- gle functions for two of the three genes. Furthermore, the databases (DBs) are internally inconsistent in that the description of gene functions for genes with partial EC numbers is inconsistent with the activities implied by reactions to which the genes were assigned. We infer that these inconsistencies result from the pro- cessing used to match gene products to reactions withinKEGG'smetabolic pathways.Theseerrorsaffect scientists who use these DBs as online encyclopedias and they affect bioinformaticists who use these DBs to train and validate newly developed algorithms.Keywords
This publication has 11 references indexed in Scilit:
- Comparative Metagenomics of Microbial CommunitiesScience, 2005
- EcoCyc: a comprehensive database resource for Escherichia coliNucleic Acids Research, 2004
- STRING: known and predicted protein-protein associations, integrated and transferred across organismsNucleic Acids Research, 2004
- Phydbac2: improved inference of gene function using interactive phylogenomic profiling and chromosomal location analysisNucleic Acids Research, 2004
- The KEGG resource for deciphering the genomeNucleic Acids Research, 2004
- Genome evolution reveals biochemical networks and functional modulesProceedings of the National Academy of Sciences, 2003
- Identification of functional links between genes using phylogenetic profilesBioinformatics, 2003
- STRING: a database of predicted functional associations between proteinsNucleic Acids Research, 2003
- The Pathway Tools softwareBioinformatics, 2002
- Identifying functional links between genes using conserved chromosomal proximityTrends in Genetics, 2002