Automatic assignment of biomedical categories: toward a generic approach
Open Access
- 15 November 2005
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 22 (6) , 658-664
- https://doi.org/10.1093/bioinformatics/bti783
Abstract
Motivation: We report on the development of a generic text categorization system designed to automatically assign biomedical categories to any input text. Unlike usual automatic text categorization systems, which rely on data-intensive models extracted from large sets of training data, our categorizer is largely data-independent. Methods: In order to evaluate the robustness of our approach we test the system on two different biomedical terminologies: the Medical Subject Headings (MeSH) and the Gene Ontology (GO). Our lightweight categorizer, based on two ranking modules, combines a pattern matcher and a vector space retrieval engine, and uses both stems and linguistically-motivated indexing units. Results and Conclusion: Results show the effectiveness of phrase indexing for both GO and MeSH categorization, but we observe the categorization power of the tool depends on the controlled vocabulary: precision at high ranks ranges from above 90% for MeSH to Contact:Patrick.Ruch@sim.hcuge.chKeywords
This publication has 13 references indexed in Scilit:
- Report on the TREC 2004 genomics trackACM SIGIR Forum, 2005
- Overview of BioCreAtIvE: critical assessment of information extraction for biologyBMC Bioinformatics, 2005
- Data-poor categorization and passage retrieval for Gene Ontology Annotation in Swiss-ProtBMC Bioinformatics, 2005
- Features Combination for Extracting Gene Functions from MEDLINEPublished by Springer Nature ,2005
- Book ReviewComputational Linguistics, 2003
- Information extraction from full text scientific articles: Where are the keywords?BMC Bioinformatics, 2003
- The Gene Ontology Annotation (GOA) Project: Implementation of GO in SWISS-PROT, TrEMBL, and InterProGenome Research, 2003
- Evaluating and reducing the effect of data corruption when applying bag of words approaches to medical recordsInternational Journal of Medical Informatics, 2002
- A definition of relevance for information retrievalPublished by Elsevier ,2002
- An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology textsComputers in Biology and Medicine, 1996