Automating document classification for the Immune Epitope Database

Open Access

26 July 2007

journal article
Published by Springer Nature in BMC Bioinformatics

Vol. 8 (1) , 269
https://doi.org/10.1186/1471-2105-8-269

Abstract

Background: The Immune Epitope Database contains information on immune epitopes curated manually from the scientific literature. Like similar projects in other knowledge domains, significant effort is spent on identifying which articles are relevant for this purpose. Results: We here report our experience in automating this process using Naïve Bayes classifiers trained on 20,910 abstracts classified by domain experts. Improvements on the basic classifier performance were made by a) utilizing information stored in PubMed beyond the abstract itself b) applying standard feature selection criteria and c) extracting domain specific feature patterns that e.g. identify peptides sequences. We have implemented the classifier into the curation process determining if abstracts are clearly relevant, clearly irrelevant, or if no certain classification can be made, in which case the abstracts are manually classified. Testing this classification scheme on an independent dataset, we achieve 95% sensitivity and specificity in the 51.1% of abstracts that were automatically classified. Conclusion: By implementing text classification, we have sped up the reference selection process without sacrificing sensitivity or specificity of the human expert classification. This study provides both practical recommendations for users of text classification tools, as well as a large dataset which can serve as a benchmark for tool developers.

Keywords

This publication has 20 references indexed in Scilit:

Substring selection for biomedical document classification
Bioinformatics, 2006
The design and implementation of the immune epitope database and analysis resource
Immunogenetics, 2005
The Immune Epitope Database and Analysis Resource: From Vision to Blueprint
PLoS Biology, 2005
UniProt: the Universal Protein knowledgebase
Nucleic Acids Research, 2004
BIND: the Biomolecular Interaction Network Database
Nucleic Acids Research, 2003
Machine learning in automated text categorization
ACM Computing Surveys, 2002
DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions
Nucleic Acids Research, 2002
Automated learning of decision rules for text categorization
ACM Transactions on Information Systems, 1994
An algorithm for suffix stripping
Program: electronic library and information systems, 1980
Relevance weighting of search terms
Journal of the American Society for Information Science, 1976