Automatic Extraction of Protein Point Mutations Using a Graph Bigram Association
Open Access
- 2 February 2007
- journal article
- research article
- Published by Public Library of Science (PLoS) in PLoS Computational Biology
- Vol. 3 (2) , e16
- https://doi.org/10.1371/journal.pcbi.0030016
Abstract
Protein point mutations are an essential component of the evolutionary and experimental analysis of protein structure and function. While many manually curated databases attempt to index point mutations, most experimentally generated point mutations and the biological impacts of the changes are described in the peer-reviewed published literature. We describe an application, Mutation GraB (Graph Bigram), that identifies, extracts, and verifies point mutations from biomedical literature. The principal problem of point mutation extraction is to link the point mutation with its associated protein and organism of origin. Our algorithm uses a graph-based bigram traversal to identify these relevant associations and exploits the Swiss-Prot protein database to verify this information. The graph bigram method is different from other models for point mutation extraction in that it incorporates frequency and positional data of all terms in an article to drive the point mutation–protein association. Our method was tested on 589 articles describing point mutations from the G protein–coupled receptor (GPCR), tyrosine kinase, and ion channel protein families. We evaluated our graph bigram metric against a word-proximity metric for term association on datasets of full-text literature in these three different protein families. Our testing shows that the graph bigram metric achieves a higher F-measure for the GPCRs (0.79 versus 0.76), protein tyrosine kinases (0.72 versus 0.69), and ion channel transporters (0.76 versus 0.74). Importantly, in situations where more than one protein can be assigned to a point mutation and disambiguation is required, the graph bigram metric achieves a precision of 0.84 compared with the word distance metric precision of 0.73. We believe the graph bigram search metric to be a significant improvement over previous search metrics for point mutation extraction and to be applicable to text-mining application requiring the association of words. In biological research, new information is often presented in the form of peer-reviewed published journal articles. Despite the best efforts of electronic database curators, a majority of this information is still found only in textual form, and thus excluded from direct computational analysis. One such type of information that is abundant in scientific literature is protein point mutations. We seek to extract protein point mutation examples from the literature and to associate them with a unique protein name and species of origin in a standardized protein database. To do this, we have created an application that searches for and retrieves full-text articles from publishers, identifies point mutation terms, protein name terms, and organism name terms within the articles. We describe Mutation GraB, an application that utilizes a graph shortest-distance search in concert with word bigram analysis that is used to find significant associations between these terms in the text. This graph bigram search metric was found to be reasonably effective at identifying correct protein point mutation pairs and represents a good compromise between accuracy and broad applicability. The application can be applied to a large set of journal literature from a protein family to generate a database of point mutations.Keywords
This publication has 25 references indexed in Scilit:
- Automatically annotating documents with normalized gene listsBMC Bioinformatics, 2005
- Gene/protein name recognition based on support vector machine using dictionary as featuresBMC Bioinformatics, 2005
- BioCreAtIvE Task 1A: gene mention finding evaluationBMC Bioinformatics, 2005
- ProMiner: rule-based protein and gene entity recognitionBMC Bioinformatics, 2005
- Overview of BioCreAtIvE task 1B: normalized gene listsBMC Bioinformatics, 2005
- A simple approach for protein name identification: prospects and limitsBMC Bioinformatics, 2005
- Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptorsBioinformatics, 2004
- Text mining: Generating hypotheses from MEDLINEJournal of the American Society for Information Science and Technology, 2003
- The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003Nucleic Acids Research, 2003
- The HUGO Mutation Database InitiativeThe Pharmacogenomics Journal, 2002