Benchmarking of the 2010 BioCreative Challenge III text-mining competition by the BioGRID and MINT interaction databases

Open Access

3 October 2011

journal article
research article
Published by Springer Nature in BMC Bioinformatics

Vol. 12 (S8) , 1-8
https://doi.org/10.1186/1471-2105-12-s8-s8

Abstract

The vast amount of data published in the primary biomedical literature represents a challenge for the automated extraction and codification of individual data elements. Biological databases that rely solely on manual extraction by expert curators are unable to comprehensively annotate the information dispersed across the entire biomedical literature. The development of efficient tools based on natural language processing (NLP) systems is essential for the selection of relevant publications, identification of data attributes and partially automated annotation. One of the tasks of the Biocreative 2010 Challenge III was devoted to the evaluation of NLP systems developed to identify articles for curation and extraction of protein-protein interaction (PPI) data. The Biocreative 2010 competition addressed three tasks: gene normalization, article classification and interaction method identification. The BioGRID and MINT protein interaction databases both participated in the generation of the test publication set for gene normalization, annotated the development and test sets for article classification, and curated the test set for interaction method classification. These test datasets served as a gold standard for the evaluation of data extraction algorithms. The development of efficient tools for extraction of PPI data is a necessary step to achieve full curation of the biomedical literature. NLP systems can in the first instance facilitate expert curation by refining the list of candidate publications that contain PPI data; more ambitiously, NLP approaches may be able to directly extract relevant information from full-text articles for rapid inspection by expert curators. Close collaboration between biological databases and NLP systems developers will continue to facilitate the long-term objectives of both disciplines.

Keywords

This publication has 40 references indexed in Scilit:

PhosphoGRID: a database of experimentally verified in vivo protein phosphorylation sites from the budding yeast Saccharomyces cerevisiae
Database: The Journal of Biological Databases and Curation, 2010
Structured digital tables on the Semantic Web: toward a structured digital literature
Molecular Systems Biology, 2010
Discovering Biological Networks from Diverse Functional Genomic Data
Published by Springer Nature ,2009
Cost-effective strategies for completing the interactome
Nature Methods, 2008
MPIDB: the microbial protein interaction database
Bioinformatics, 2008
InnateDB: facilitating systems‐level analyses of the mammalian innate immune response
Molecular Systems Biology, 2008
Integrating physical and genetic maps: from genomes to interaction networks
Nature Reviews Genetics, 2007
The minimum information required for reporting a molecular interaction experiment (MIMIx)
Nature Biotechnology, 2007
Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project
Nature, 2007
Network‐based classification of breast cancer metastasis
Molecular Systems Biology, 2007