NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition

Open Access

18 December 2006

journal article
research article
Published by Springer Nature in BMC Bioinformatics

Vol. 7 (S5) , S11
https://doi.org/10.1186/1471-2105-7-s5-s11

Abstract

Biomedical named entity recognition (Bio-NER) is a challenging problem because, in general, biomedical named entities of the same category (e.g., proteins and genes) do not follow one standard nomenclature. They have many irregularities and sometimes appear in ambiguous contexts. In recent years, machine-learning (ML) approaches have become increasingly common and now represent the cutting edge of Bio-NER technology. This paper addresses three problems faced by ML-based Bio-NER systems. First, most ML approaches usually employ singleton features that comprise one linguistic property (e.g., the current word is capitalized) and at least one class tag (e.g., B-protein, the beginning of a protein name). However, such features may be insufficient in cases where multiple properties must be considered. Adding conjunction features that contain multiple properties can be beneficial, but it would be infeasible to include all conjunction features in an NER model since memory resources are limited and some features are ineffective. To resolve the problem, we use a sequential forward search algorithm to select an effective set of features. Second, variations in the numerical parts of biomedical terms (e.g., "2" in the biomedical term IL2) cause data sparseness and generate many redundant features. In this case, we apply numerical normalization, which solves the problem by replacing all numerals in a term with one representative numeral to help classify named entities. Third, the assignment of NE tags does not depend solely on the target word's closest neighbors, but may depend on words outside the context window (e.g., a context window of five consists of the current word plus two preceding and two subsequent words). We use global patterns generated by the Smith-Waterman local alignment algorithm to identify such structures and modify the results of our ML-based tagger. This is called pattern-based post-processing.

Keywords

This publication has 10 references indexed in Scilit:

Various criteria in the evaluation of biomedical named entity recognition
BMC Bioinformatics, 2006
Identifying gene and protein mentions in text using conditional random fields
BMC Bioinformatics, 2005
Feature engineering and post-processing for temporal expression recognition using conditional random fields
Published by Association for Computational Linguistics (ACL) ,2005
iProLINK: an integrated protein resource for literature mining
Computational Biology and Chemistry, 2004
Identification of common molecular subsequences
Published by Elsevier ,2004
Recognizing names in biomedical texts: a machine learning approach
Bioinformatics, 2004
Mining the Biomedical Literature in the Genomic Era: An Overview
Journal of Computational Biology, 2003
PLAYING BIOLOGY'S NAME GAME: IDENTIFYING PROTEIN NAMES IN SCIENTIFIC TEXT
Pacific Symposium on Biocomputing, 2002
Toward information extraction: identifying protein names from biological papers.
1998
Error bounds for convolutional codes and an asymptotically optimum decoding algorithm
IEEE Transactions on Information Theory, 1967