Mining physical protein-protein interactions from the literature

Open Access

1 September 2008

journal article
Published by Springer Nature in Genome Biology

Vol. 9 (S2) , S12
https://doi.org/10.1186/gb-2008-9-s2-s12

Abstract

Background: Deciphering physical protein-protein interactions is fundamental to elucidating both the functions of proteins and biological processes. The development of high-throughput experimental technologies such as the yeast two-hybrid screening has produced an explosion in data relating to interactions. Since manual curation is intensive in terms of time and cost, there is an urgent need for text-mining tools to facilitate the extraction of such information. The BioCreative (Critical Assessment of Information Extraction systems in Biology) challenge evaluation provided common standards and shared evaluation criteria to enable comparisons among different approaches. Results: During the benchmark evaluation of BioCreative 2006, all of our results ranked in the top three places. In the task of filtering articles irrelevant to physical protein interactions, our method contributes a precision of 75.07%, a recall of 81.07%, and an AUC (area under the receiver operating characteristic curve) of 0.847. In the task of identifying protein mentions and normalizing mentions to molecule identifiers, our method is competitive among runs submitted, with a precision of 34.83%, a recall of 24.10%, and an F₁ score of28.5%. In extracting protein interaction pairs, our profile-based method was competitive on the SwissProt-only subset (precision = 36.95%, recall = 32.68%, and F₁ score = 30.40%) and on the entire dataset (30.96%, 29.35%, and26.20%, respectively). From the biologist's point of view, however, these findings are far from satisfactory. The error analysis presented in this report provides insight into how performance could be improved: three-quarters of false negatives were due to protein normalization problems (532/698), and about one-quarter were due to problems with correctly extracting interactions for this system. Conclusion: We present a text-mining framework to extract physical protein-protein interactions from the literature. Three key issues are addressed, namely filtering irrelevant articles, identifying protein names and normalizing them to molecule identifiers, and extracting protein-protein interactions. Our system is among the top three performers in the benchmark evaluation of BioCreative 2006. The tool will be helpful for manual interaction curation and can greatly facilitate the process of extracting protein-protein interactions.

Keywords

This publication has 14 references indexed in Scilit:

Extracting human protein interactions from MEDLINE using a full-sentence parser
Bioinformatics, 2004
IntAct: an open source molecular interaction database
Nucleic Acids Research, 2004
UniProt: the Universal Protein knowledgebase
Nucleic Acids Research, 2004
GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data
Journal of Biomedical Informatics, 2003
PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine
BMC Bioinformatics, 2003
The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003
Nucleic Acids Research, 2003
MINT: a Molecular INTeraction database
FEBS Letters, 2001
THE SPECTRUM KERNEL: A STRING KERNEL FOR SVM PROTEIN CLASSIFICATION
Pacific Symposium on Biocomputing, 2001
Mining literature for protein–protein interactions
Bioinformatics, 2001
Automated extraction of information on protein–protein interactions from the biological literature
Bioinformatics, 2001