Applying negative rule mining to improve genome annotation

Open Access

21 July 2007

journal article
Published by Springer Nature in BMC Bioinformatics

Vol. 8 (1) , 261
https://doi.org/10.1186/1471-2105-8-261

Abstract

Background: Unsupervised annotation of proteins by software pipelines suffers from very high error rates. Spurious functional assignments are usually caused by unwarranted homology-based transfer of information from existing database entries to the new target sequences. We have previously demonstrated that data mining in large sequence annotation databanks can help identify annotation items that are strongly associated with each other, and that exceptions from strong positive association rules often point to potential annotation errors. Here we investigate the applicability of negative association rule mining to revealing erroneously assigned annotation items. Results: Almost all exceptions from strong negative association rules are connected to at least one wrong attribute in the feature combination making up the rule. The fraction of annotation features flagged by this approach as suspicious is strongly enriched in errors and constitutes about 0.6% of the whole body of the similarity-transferred annotation in the PEDANT genome database. Positive rule mining does not identify two thirds of these errors. The approach based on exceptions from negative rules is much more specific than positive rule mining, but its coverage is significantly lower. Conclusion: Mining of both negative and positive association rules is a potent tool for finding significant trends in protein annotation and flagging doubtful features for further inspection.

Keywords

This publication has 26 references indexed in Scilit:

New developments in the InterPro database
Nucleic Acids Research, 2007
PEDANT genome database: 10 years online
Nucleic Acids Research, 2006
The Universal Protein Resource (UniProt)
Nucleic Acids Research, 2006
Functional Classification Using Phylogenomic Inference
PLoS Computational Biology, 2006
Pfam: clans, web tools and services
Nucleic Acids Research, 2006
Mining sequence annotation databanks for association patterns
Bioinformatics, 2005
EGASP: collaboration through competition to find human genes
Nature Methods, 2005
The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes
Nucleic Acids Research, 2004
Protein secondary structure prediction based on position-specific scoring matrices 1 1Edited by G. Von Heijne
Journal of Molecular Biology, 1999
Go hunting in sequence databases but watch out for the traps
Trends in Genetics, 1996