Automatically annotating documents with normalized gene lists

Open Access

24 May 2005

journal article
research article
Published by Springer Nature in BMC Bioinformatics

Vol. 6 (S1) , S13
https://doi.org/10.1186/1471-2105-6-s1-s13

Abstract

Background Document gene normalization is the problem of creating a list of unique identifiers for genes that are mentioned within a document. Automating this process has many potential applications in both information extraction and database curation systems. Here we present two separate solutions to this problem. The first is primarily based on standard pattern matching and information extraction techniques. The second and more novel solution uses a statistical classifier to recognize valid gene matches from a list of known gene synonyms. Results We compare the results of the two systems, analyze their merits and argue that the classification based system is preferable for many reasons including performance, simplicity and robustness. Our best systems attain a balanced precision and recall in the range of 74%–92%, depending on the organism.

Keywords

This publication has 8 references indexed in Scilit:

Identifying gene and protein mentions in text using conditional random fields
BMC Bioinformatics, 2005
Overview of BioCreAtIvE task 1B: normalized gene lists
BMC Bioinformatics, 2005
A critical assessment of text mining methods in molecular biology. Proceedings of a workshop. March 28-31, 2004. Granada, Spain.
2005
Gene name identification and normalization using a model organism database
Journal of Biomedical Informatics, 2004
Extracting synonymous gene and protein terms from biological literature
Bioinformatics, 2003
A BIOLOGICAL NAMED ENTITY RECOGNIZER
Pacific Symposium on Biocomputing, 2002
A Gaussian Prior for Smoothing Maximum Entropy Models
Published by Defense Technical Information Center (DTIC) ,1999
An algorithm for suffix stripping
Program: electronic library and information systems, 1980