Quantitative Assessment of Dictionary-based Protein Named Entity Tagging
Open Access
- 1 September 2006
- journal article
- Published by Oxford University Press (OUP) in Journal of the American Medical Informatics Association
- Vol. 13 (5) , 497-507
- https://doi.org/10.1197/jamia.m2085
Abstract
Objective: Natural language processing (NLP) approaches have been explored to manage and mine information recorded in biological literature. A critical step for biological literature mining is biological named entity tagging (BNET) that identifies names mentioned in text and normalizes them with entries in biological databases. The aim of this study was to provide quantitative assessment of the complexity of BNET on protein entities through BioThesaurus, a thesaurus of gene/protein names for UniProt knowledgebase (UniProtKB) entries that was acquired using online resources. Methods: We evaluated the complexity through several perspectives: ambiguity (i.e., the number of genes/proteins represented by one name), synonymy (i.e., the number of names associated with the same gene/protein), and coverage (i.e., the percentage of gene/protein names in text included in the thesaurus). We also normalized names in BioThesaurus and measures were obtained twice, once before normalization and once after. Results: The current version of BioThesaurus has over 2.6 million names or 2.1 million normalized names covering more than 1.8 million UniProtKB entries. The average synonymy is 3.53 (2.86 after normalization), ambiguity is 2.31 before normalization and 2.32 after, while the coverage is 94.0% based on the BioCreAtive data set comprising MEDLINE abstracts containing genes/proteins. Conclusion: The study indicated that names for genes/proteins are highly ambiguous and there are usually multiple names for the same gene or protein. It also demonstrated that most gene/protein names appearing in text can be found in BioThesaurus.Keywords
This publication has 56 references indexed in Scilit:
- Resolving abbreviations to their senses in MedlineBioinformatics, 2005
- Overview of BioCreAtIvE: critical assessment of information extraction for biologyBMC Bioinformatics, 2005
- GAPSCORE: finding gene and protein names one word at a timeBioinformatics, 2004
- GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway dataJournal of Biomedical Informatics, 2003
- Tagging gene and protein names in biomedical textBioinformatics, 2002
- Disambiguating Ambiguous Biomedical Terms in Biomedical Narrative Text: An Unsupervised MethodJournal of Biomedical Informatics, 2001
- EVENT EXTRACTION FROM BIOMEDICAL PAPERS USING A FULL PARSERPacific Symposium on Biocomputing, 2000
- Two Applications of Information Extraction to Biological Science Journal Articles: Enzyme Interactions and Protein StructuresPublished by World Scientific Pub Co Pte Ltd ,1999
- Isolation and chromosomal assignment of human genes encoding cofactor of LIM homeodomain proteins, CLIM1 and CLIM2Journal of Human Genetics, 1999
- Characterization of the human 36-kDa carboxyl terminal LIM domain protein (hCLIM1)Journal of Cellular Biochemistry, 1999