Gene mention normalization and interaction extraction with context models and sentence motifs
Open Access
- 1 September 2008
- journal article
- research article
- Published by Springer Nature in Genome Biology
- Vol. 9 (S2) , S14
- https://doi.org/10.1186/gb-2008-9-s2-s14
Abstract
Background: The goal of text mining is to make the information conveyed in scientific publications accessible to structured search and automatic analysis. Two important subtasks of text mining are entity mention normalization - to identify biomedical objects in text - and extraction of qualified relationships between those objects. We describe a method for identifying genes and relationships between proteins. Results: We present solutions to gene mention normalization and extraction of protein-protein interactions. For the first task, we identify genes by using background knowledge on each gene, namely annotations related to function, location, disease, and so on. Our approach currently achieves an f-measure of 86.4% on the BioCreative II gene normalization data. For the extraction of protein-protein interactions, we pursue an approach that builds on classical sequence analysis: motifs derived from multiple sequence alignments. The method achieves an f-measure of 24.4% (micro-average) in the BioCreative II interaction pair subtask. Conclusion: For gene mention normalization, our approach outperforms strategies that utilize only the matching of genes names against dictionaries, without invoking further knowledge on each gene. Motifs derived from alignments of sentences are successful at identifying protein interactions in text; the approach we present in this report is fully automated and performs similarly to systems that require human intervention at one or more stages. Availability: Our methods for gene, protein, and species identification, and extraction of protein-protein are available as part of the BioCreative Meta Services (BCMS), see http://bcms.bioinfo.cnio.es/.Keywords
This publication has 32 references indexed in Scilit:
- Overview of BioCreative II gene mention recognitionGenome Biology, 2008
- Evaluation of text-mining systems for biology: overview of the Second BioCreative community challengeGenome Biology, 2008
- Consistent probabilistic outputs for protein function predictionGenome Biology, 2008
- GeneMANIA: a real-time multiple association network integration algorithm for predicting gene functionGenome Biology, 2008
- Predicting gene function in a hierarchical context with an ensemble of classifiersGenome Biology, 2008
- A critical assessment of Mus musculusgene function prediction using integrated genomic evidenceGenome Biology, 2008
- Manual curation is not sufficient for annotation of genomic databasesBioinformatics, 2007
- Proteome survey reveals modularity of the yeast cell machineryNature, 2006
- IntAct: an open source molecular interaction databaseNucleic Acids Research, 2004
- Multiple sequence alignment with the Clustal series of programsNucleic Acids Research, 2003