Automatic reconstruction of a bacterial regulatory network using Natural Language Processing

Open Access

7 August 2007

journal article
research article
Published by Springer Nature in BMC Bioinformatics

Vol. 8 (1) , 1-11
https://doi.org/10.1186/1471-2105-8-293

Abstract

Manual curation of biological databases, an expensive and labor-intensive process, is essential for high quality integrated data. In this paper we report the implementation of a state-of-the-art Natural Language Processing system that creates computer-readable networks of regulatory interactions directly from different collections of abstracts and full-text papers. Our major aim is to understand how automatic annotation using Text-Mining techniques can complement manual curation of biological databases. We implemented a rule-based system to generate networks from different sets of documents dealing with regulation in Escherichia coli K-12. Performance evaluation is based on the most comprehensive transcriptional regulation database for any organism, the manually-curated RegulonDB, 45% of which we were able to recreate automatically. From our automated analysis we were also able to find some new interactions from papers not already curated, or that were missed in the manual filtering and review of the literature. We also put forward a novel Regulatory Interaction Markup Language better suited than SBML for simultaneously representing data of interest for biologists and text miners. Manual curation of the output of automatic processing of text is a good way to complement a more detailed review of the literature, either for validating the results of what has been already annotated, or for discovering facts and information that might have been overlooked at the triage or curation stages.

Keywords

This publication has 19 references indexed in Scilit:

Imitating Manual Curation of Text-Mined Facts in Biomedicine
PLoS Computational Biology, 2006
RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions
Nucleic Acids Research, 2006
Overview of BioCreAtIvE: critical assessment of information extraction for biology
BMC Bioinformatics, 2005
Text-mining approaches in molecular biology and biomedicine
Drug Discovery Today, 2005
EcoCyc: a comprehensive database resource for Escherichia coli
Nucleic Acids Research, 2004
A gene network for navigating the literature
Nature Genetics, 2004
BioRAT: extracting biological information from full-length papers
Bioinformatics, 2004
Evolving a lingua franca and associated software infrastructure for computational systems biology: the Systems Biology Markup Language (SBML) project
Systems Biology, 2004
Pathway Databases: A Case Study in Computational Symbolic Theories
Science, 2001
Partial parsing via finite-state cascades
Natural Language Engineering, 1996