A scalable machine-learning approach to recognize chemical names within large text databases

Open Access

26 September 2006

journal article
Published by Springer Nature in BMC Bioinformatics

Vol. 7 (2) , S3
https://doi.org/10.1186/1471-2105-7-S2-S3

Abstract

The use or study of chemical compounds permeates almost every scientific field and in each of them, the amount of textual information is growing rapidly. There is a need to accurately identify chemical names within text for a number of informatics efforts such as database curation, report summarization, tagging of named entities and keywords, or the development/curation of reference databases. A first-order Markov Model (MM) was evaluated for its ability to distinguish chemical names from words, yielding approximately 93% recall in recognizing chemical terms and approximately 99% precision in rejecting non-chemical terms on smaller test sets. However, because total false-positive events increase with the number of words analyzed, the scalability of name recognition was measured by processing 13.1 million MEDLINE records. The method yielded precision ranges from 54.7% to 100%, depending upon the cutoff score used, averaging 82.7% for approximately 1.05 million putative chemical terms extracted. Extracted chemical terms were analyzed to estimate the number of spelling variants per term, which correlated with the total number of times the chemical name appeared in MEDLINE. This variability in term construction was found to affect both information retrieval and term mapping when using PubMed and Ovid.

Keywords

This publication has 18 references indexed in Scilit:

Exploring the boundaries: gene and protein identification in biomedical text
BMC Bioinformatics, 2005
What makes a gene name? Named entity recognition in the biomedical literature
Briefings in Bioinformatics, 2005
Biomedical term mapping databases
Nucleic Acids Research, 2004
Knowledge discovery by automated identification and ranking of implicit relationships
Bioinformatics, 2004
Shared relationship analysis: ranking set cohesion and commonalities within a literature-derived relationship network
Bioinformatics, 2004
Computer-Assisted Generation of a Protein-Interaction Database for Nuclear Receptors
Molecular Endocrinology, 2003
Terminology-driven mining of biomedical literature
Bioinformatics, 2003
PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine
BMC Bioinformatics, 2003
A tutorial on hidden Markov models and selected applications in speech recognition
Proceedings of the IEEE, 1989
Approximate String Matching
ACM Computing Surveys, 1980