Gene symbol disambiguation using knowledge-based profiles
Open Access
- 21 February 2007
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 23 (8) , 1015-1022
- https://doi.org/10.1093/bioinformatics/btm056
Abstract
Motivation: The ambiguity of biomedical entities, particularly of gene symbols, is a big challenge for text-mining systems in the biomedical domain. Existing knowledge sources, such as Entrez Gene and the MEDLINE database, contain information concerning the characteristics of a particular gene that could be used to disambiguate gene symbols. Results: For each gene, we create a profile with different types of information automatically extracted from related MEDLINE abstracts and readily available annotated knowledge sources. We apply the gene profiles to the disambiguation task via an information retrieval method, which ranks the similarity scores between the context where the ambiguous gene is mentioned, and candidate gene profiles. The gene profile with the highest similarity score is then chosen as the correct sense. We evaluated the method on three automatically generated testing sets of mouse, fly and yeast organisms, respectively. The method achieved the highest precision of 93.9% for the mouse, 77.8% for the fly and 89.5% for the yeast. Availability: The testing data sets and disambiguation programs are available at http://www.dbmi.columbia.edu/~hux7002/gsd2006 Contact:friedman@dbmi.columbia.eduKeywords
This publication has 20 references indexed in Scilit:
- Status of text-mining techniques applied to biomedical textDrug Discovery Today, 2006
- Literature mining for the biologist: from information retrieval to biological discoveryNature Reviews Genetics, 2006
- Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing: Preliminary experimentJournal of the American Society for Information Science and Technology, 2005
- Overview of BioCreAtIvE task 1B: normalized gene listsBMC Bioinformatics, 2005
- Gene name ambiguity of eukaryotic nomenclaturesBioinformatics, 2004
- Distribution of information in biomedical abstracts and full-text publicationsBioinformatics, 2004
- Automatic Resolution of Ambiguous Terms Based on Machine Learning and Conceptual Relations in the UMLSJournal of the American Medical Informatics Association, 2002
- An algorithm for suffix strippingProgram: electronic library and information systems, 1980
- Multiple Comparisons Using Rank SumsTechnometrics, 1964
- The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of VarianceJournal of the American Statistical Association, 1937