Gene symbol disambiguation using knowledge-based profiles

Open Access

21 February 2007

journal article
research article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 23 (8) , 1015-1022
https://doi.org/10.1093/bioinformatics/btm056

Abstract

Motivation: The ambiguity of biomedical entities, particularly of gene symbols, is a big challenge for text-mining systems in the biomedical domain. Existing knowledge sources, such as Entrez Gene and the MEDLINE database, contain information concerning the characteristics of a particular gene that could be used to disambiguate gene symbols. Results: For each gene, we create a profile with different types of information automatically extracted from related MEDLINE abstracts and readily available annotated knowledge sources. We apply the gene profiles to the disambiguation task via an information retrieval method, which ranks the similarity scores between the context where the ambiguous gene is mentioned, and candidate gene profiles. The gene profile with the highest similarity score is then chosen as the correct sense. We evaluated the method on three automatically generated testing sets of mouse, fly and yeast organisms, respectively. The method achieved the highest precision of 93.9% for the mouse, 77.8% for the fly and 89.5% for the yeast. Availability: The testing data sets and disambiguation programs are available at http://www.dbmi.columbia.edu/~hux7002/gsd2006 Contact:friedman@dbmi.columbia.edu

Keywords

This publication has 20 references indexed in Scilit:

Status of text-mining techniques applied to biomedical text
Drug Discovery Today, 2006
Literature mining for the biologist: from information retrieval to biological discovery
Nature Reviews Genetics, 2006
Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing: Preliminary experiment
Journal of the American Society for Information Science and Technology, 2005
Overview of BioCreAtIvE task 1B: normalized gene lists
BMC Bioinformatics, 2005
Gene name ambiguity of eukaryotic nomenclatures
Bioinformatics, 2004
Distribution of information in biomedical abstracts and full-text publications
Bioinformatics, 2004
Automatic Resolution of Ambiguous Terms Based on Machine Learning and Conceptual Relations in the UMLS
Journal of the American Medical Informatics Association, 2002
An algorithm for suffix stripping
Program: electronic library and information systems, 1980
Multiple Comparisons Using Rank Sums
Technometrics, 1964
The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance
Journal of the American Statistical Association, 1937