A Statistical Approach to Scanning the Biomedical Literature for Pharmacogenetics Knowledge

Open Access

23 November 2004

journal article
Published by Oxford University Press (OUP) in Journal of the American Medical Informatics Association

Vol. 12 (2) , 121-129
https://doi.org/10.1197/jamia.m1640

Abstract

Objective: Biomedical databases summarize current scientific knowledge, but they generally require years of laborious curation effort to build, focusing on identifying pertinent literature and data in the voluminous biomedical literature. It is difficult to manually extract useful information embedded in the large volumes of literature, and automated intelligent text analysis tools are becoming increasingly essential to assist in these curation activities. The goal of the authors was to develop an automated method to identify articles in Medline citations that contain pharmacogenetics data pertaining to gene–drug relationships. Design: The authors built and evaluated several candidate statistical models that characterize pharmacogenetics articles in terms of word usage and the profile of Medical Subject Headings (MeSH) used in those articles. The best-performing model was used to scan the entire Medline article database (11 million articles) to identify candidate pharmacogenetics articles. Results: A sampling of the articles identified from scanning Medline was reviewed by a pharmacologist to assess the precision of the method. The authors' approach identified 4,892 pharmacogenetics articles in the literature with 92% precision. Their automated method took a fraction of the time to acquire these articles compared with the time expected to be taken to accumulate them manually. The authors have built a Web resource (http://pharmdemo.stanford.edu/pharmdb/main.spy) to provide access to their results. Conclusion: A statistical classification approach can screen the primary literature to pharmacogenetics articles with high precision. Such methods may assist curators in acquiring pertinent literature in building biomedical databases.

Keywords

This publication has 17 references indexed in Scilit:

Comment on “On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes”
Neural Processing Letters, 2008
Mining the Biomedical Literature in the Genomic Era: An Overview
Journal of Computational Biology, 2003
A Machine Learning Approach for the Curation of Biomedical Literature
Published by Springer Nature ,2003
PharmGKB: the Pharmacogenetics Knowledge Base
Nucleic Acids Research, 2002
Pharmacogenomics: The Inherited Basis for Interindividual Differences in Drug Response
Annual Review of Genomics and Human Genetics, 2001
Neural Networks and the Bias/Variance Dilemma
Neural Computation, 1992
Ridge Estimators in Logistic Regression
Journal of the Royal Statistical Society Series C: Applied Statistics, 1992
An algorithm for suffix stripping
Program: electronic library and information systems, 1980
Selection of Medline contents, the development of its thesaurus, and the indexing process
Medical Informatics, 1978
A vector space model for automatic indexing
Communications of the ACM, 1975