Ranking the whole MEDLINE database according to a large training set using text indexing

Open Access

24 March 2005

journal article
research article
Published by Springer Nature in BMC Bioinformatics

Vol. 6 (1) , 75
https://doi.org/10.1186/1471-2105-6-75

Abstract

Background: The MEDLINE database contains over 12 million references to scientific literature, with about 3/4 of recent articles including an abstract of the publication. Retrieval of entries using queries with keywords is useful for human users that need to obtain small selections. However, particular analyses of the literature or database developments may need the complete ranking of all the references in the MEDLINE database as to their relevance to a topic of interest. This report describes a method that does this ranking using the differences in word content between MEDLINE entries related to a topic and the whole of MEDLINE, in a computational time appropriate for an article search query engine. Results: We tested the capabilities of our system to retrieve MEDLINE references which are relevant to the subject of stem cells. We took advantage of the existing annotation of references with terms from the MeSH hierarchical vocabulary (Medical Subject Headings, developed at the National Library of Medicine). A training set of 81,416 references was constructed by selecting entries annotated with the MeSH term stem cells or some child in its sub tree. Frequencies of all nouns, verbs, and adjectives in the training set were computed and the ratios of word frequencies in the training set to those in the entire MEDLINE were used to score references. Self-consistency of the algorithm, benchmarked with a test set containing the training set and an equal number of references randomly selected from MEDLINE was better using nouns (79%) than adjectives (73%) or verbs (70%). The evaluation of the system with 6,923 references not used for training, containing 204 articles relevant to stem cells according to a human expert, indicated a recall of 65% for a precision of 65%. Conclusion: This strategy appears to be useful for predicting the relevance of MEDLINE references to a given concept. The method is simple and can be used with any user-defined training set. Choice of the part of speech of the words used for classification has important effects on performance. Lists of words, scripts, and additional information are available from the web address http://www.ogic.ca/projects/ks2004/.

Keywords

This publication has 13 references indexed in Scilit:

The Gene Ontology (GO) database and informatics resource
Nucleic Acids Research, 2004
Information extraction from full text scientific articles: Where are the keywords?
BMC Bioinformatics, 2003
The way we write
EMBO Reports, 2003
Computing Fuzzy Associations for the Analysis of Biological Literature
BioTechniques, 2002
Conceptual biology: a semantic issue and more
Nature, 2002
CREATING KNOWLEDGE REPOSITORIES FROM BIOMEDICAL REPORTS: THE MEDSYNDIKATE TEXT MINING SYSTEM
Pacific Symposium on Biocomputing, 2001
XplorMed: a tool for exploring MEDLINE abstracts
Trends in Biochemical Sciences, 2001
Automatic MeSH term assignment and quality assessment.
2001
Boosting naïve Bayesian learning on a large subset of MEDLINE.
2000
An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts
Computers in Biology and Medicine, 1996