Automatic document classification of biological literature

Open Access

7 August 2006

journal article
Published by Springer Nature in BMC Bioinformatics

Vol. 7 (1) , 370
https://doi.org/10.1186/1471-2105-7-370

Abstract

Background: Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis elegans literature. Results: We present a two-step text categorization algorithm to classify a corpus of C. elegans papers. Our classification method first uses a support vector machine-trained classifier, followed by a novel, phrase-based clustering algorithm. This clustering step autonomously creates cluster labels that are descriptive and understandable by humans. This clustering engine performed better on a standard test-set (Reuters 21578) compared to previously published results (F-value of 0.55 vs. 0.49), while producing cluster descriptions that appear more useful. A web interface allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept. Conclusion: We have demonstrated a simple method to classify biological documents that embodies an improvement over current methods. While the classification results are currently optimized for Caenorhabditis elegans papers by human-created rules, the classification engine can be adapted to different types of documents. We have demonstrated this by presenting a web interface that allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept.

Keywords

This publication has 11 references indexed in Scilit:

Literature mining for the biologist: from information retrieval to biological discovery
Nature Reviews Genetics, 2006
An analysis of the relative hardness of Reuters‐21578 subsets
Journal of the American Society for Information Science and Technology, 2005
Interactions with microbial pathogens
WormBook, 2005
Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature
PLoS Biology, 2004
Efficient phrase-based document indexing for Web document clustering
IEEE Transactions on Knowledge and Data Engineering, 2004
Tough Mining
PLoS Biology, 2003
Getting to the (c)ore of knowledge: mining biomedical literature
International Journal of Medical Informatics, 2002
Mining information for functional genomics
IEEE Intelligent Systems, 2002
Automated extraction of information in molecular biology
FEBS Letters, 2000
An algorithm for suffix stripping
Program: electronic library and information systems, 1980