Combining NLP and probabilistic categorisation for document and term selection for Swiss-Prot medical annotation
Open Access
- 3 July 2003
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 19 (suppl_1) , i91-i94
- https://doi.org/10.1093/bioinformatics/btg1011
Abstract
Motivation: Searching relevant publications for manual database annotation is a tedious task. In this paper, we apply a combination of Natural Language Processing (NLP) and probabilistic classification to re-rank documents returned by PubMed according to their relevance to Swiss-Prot annotation, and to identify significant terms in the documents. Results: With a Probabilistic Latent Categoriser (PLC) we obtained 69% recall and 59% precision for relevant documents in a representative query. As the PLC technique provides the relative contribution of each term to the final document score, we used the Kullback-Leibler symmetric divergence to determine the most discriminating words for Swiss-Prot medical annotation. This information should allow curators to understand classification results better. It also has great value for fine-tuning the linguistic pre-processing of documents, which in turn can improve the overall classifier performance. Availability: The medical annotation dataset is available from the authors upon request Contact: Pavel.Dobrokhotov@isb-sib.ch; Cyril.Goutte@xrce.xerox.com *To whom correspondence should be addressed.Keywords
This publication has 0 references indexed in Scilit: