BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature

Open Access

3 December 2009

journal article
Published by Springer Nature in BMC Bioinformatics

Vol. 10 (S15) , S7
https://doi.org/10.1186/1471-2105-10-s15-s7

Abstract

Background To automatically process large quantities of biological literature for knowledge discovery and information curation, text mining tools are becoming essential. Abbreviation recognition is related to NER and can be considered as a pair recognition task of a terminology and its corresponding abbreviation from free text. The successful identification of abbreviation and its corresponding definition is not only a prerequisite to index terms of text databases to produce articles of related interests, but also a building block to improve existing gene mention tagging and gene normalization tools. Results Our approach to abbreviation recognition (AR) is based on machine-learning, which exploits a novel set of rich features to learn rules from training data. Tested on the AB3P corpus, our system demonstrated a F-score of 89.90% with 95.86% precision at 84.64% recall, higher than the result achieved by the existing best AR performance system. We also annotated a new corpus of 1200 PubMed abstracts which was derived from BioCreative II gene normalization corpus. On our annotated corpus, our system achieved a F-score of 86.20% with 93.52% precision at 79.95% recall, which also outperforms all tested systems. Conclusion By applying our system to extract all short form-long form pairs from all available PubMed abstracts, we have constructed BIOADI. Mining BIOADI reveals many interesting trends of bio-medical research. Besides, we also provide an off-line AR software in the download section on http://bioagent.iis.sinica.edu.tw/BIOADI/.

Keywords

This publication has 29 references indexed in Scilit:

Abbreviation definition identification based on automatic precision estimates
BMC Bioinformatics, 2008
Overview of the protein-protein interaction annotation extraction task of BioCreative II
Genome Biology, 2008
Overview of BioCreative II gene normalization
Genome Biology, 2008
Integrating high dimensional bi-directional parsing models for gene mention tagging
Bioinformatics, 2008
Identification of transcription factor contexts in literature using machine learning approaches
BMC Bioinformatics, 2008
A comparison study on algorithms of detecting long forms for short forms in biomedical text
BMC Bioinformatics, 2007
Literature mining for the biologist: from information retrieval to biological discovery
Nature Reviews Genetics, 2006
ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text
Bioinformatics, 2005
Enhancing performance of protein and gene name recognizers with filtering and integration strategies
Journal of Biomedical Informatics, 2004
Term identification in the biomedical literature
Journal of Biomedical Informatics, 2004