Discovering patterns to extract protein–protein interactions from the literature: Part II
Open Access
- 12 May 2005
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 21 (15) , 3294-3300
- https://doi.org/10.1093/bioinformatics/bti493
Abstract
Motivation: An enormous number of protein–protein interaction relationships are buried in millions of research articles published over the years, and the number is growing. Rediscovering them automatically is a challenging bioinformatics task. Solutions to this problem also reach far beyond bioinformatics. Results: We study a new approach that involves automatically discovering English expression patterns, optimizing them and using them to extract protein–protein interactions. In a sister paper, we described how to generate English expression patterns related to protein–protein interactions, and this approach alone has already achieved precision and recall rates significantly higher than those of other automatic systems. This paper continues to present our theory, focusing on how to improve the patterns. A minimum description length (MDL)-based pattern-optimization algorithm is designed to reduce and merge patterns. This has significantly increased generalization power, and hence the recall and precision rates, as confirmed by ourexperiments. Availability:http://spies.cs.tsinghua.edu.cn Contact:zxy-dcs@tsinghua.edu.cnKeywords
This publication has 16 references indexed in Scilit:
- Discovering patterns to extract protein–protein interactions from full textsBioinformatics, 2004
- The Database of Interacting Proteins: 2004 updateNucleic Acids Research, 2004
- Extraction of protein interaction information from unstructured text using a context-free grammarBioinformatics, 2003
- Accomplishments and challenges in literature data mining for biologyBioinformatics, 2002
- Robust Relational Parsing over Biomedical Literature: Extracting Inhibit RelationPacific Symposium on Biocomputing, 2001
- GENIES: a natural-language processing system for the extraction of molecular pathways from journal articlesBioinformatics, 2001
- Mining literature for protein–protein interactionsBioinformatics, 2001
- BIND--The Biomolecular Interaction Network DatabaseNucleic Acids Research, 2001
- Minimum description length induction, Bayesianism, and Kolmogorov complexityIEEE Transactions on Information Theory, 2000
- Modeling by shortest data descriptionAutomatica, 1978