Automatic document classification of biological literature
Open Access
- 7 August 2006
- journal article
- Published by Springer Nature in BMC Bioinformatics
- Vol. 7 (1) , 370
- https://doi.org/10.1186/1471-2105-7-370
Abstract
Background: Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis elegans literature. Results: We present a two-step text categorization algorithm to classify a corpus of C. elegans papers. Our classification method first uses a support vector machine-trained classifier, followed by a novel, phrase-based clustering algorithm. This clustering step autonomously creates cluster labels that are descriptive and understandable by humans. This clustering engine performed better on a standard test-set (Reuters 21578) compared to previously published results (F-value of 0.55 vs. 0.49), while producing cluster descriptions that appear more useful. A web interface allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept. Conclusion: We have demonstrated a simple method to classify biological documents that embodies an improvement over current methods. While the classification results are currently optimized for Caenorhabditis elegans papers by human-created rules, the classification engine can be adapted to different types of documents. We have demonstrated this by presenting a web interface that allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept.Keywords
This publication has 11 references indexed in Scilit:
- Literature mining for the biologist: from information retrieval to biological discoveryNature Reviews Genetics, 2006
- An analysis of the relative hardness of Reuters‐21578 subsetsJournal of the American Society for Information Science and Technology, 2005
- Interactions with microbial pathogensWormBook, 2005
- Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological LiteraturePLoS Biology, 2004
- Efficient phrase-based document indexing for Web document clusteringIEEE Transactions on Knowledge and Data Engineering, 2004
- Tough MiningPLoS Biology, 2003
- Getting to the (c)ore of knowledge: mining biomedical literatureInternational Journal of Medical Informatics, 2002
- Mining information for functional genomicsIEEE Intelligent Systems, 2002
- Automated extraction of information in molecular biologyFEBS Letters, 2000
- An algorithm for suffix strippingProgram: electronic library and information systems, 1980