Ranking the whole MEDLINE database according to a large training set using text indexing
Open Access
- 24 March 2005
- journal article
- research article
- Published by Springer Nature in BMC Bioinformatics
- Vol. 6 (1) , 75
- https://doi.org/10.1186/1471-2105-6-75
Abstract
Background: The MEDLINE database contains over 12 million references to scientific literature, with about 3/4 of recent articles including an abstract of the publication. Retrieval of entries using queries with keywords is useful for human users that need to obtain small selections. However, particular analyses of the literature or database developments may need the complete ranking of all the references in the MEDLINE database as to their relevance to a topic of interest. This report describes a method that does this ranking using the differences in word content between MEDLINE entries related to a topic and the whole of MEDLINE, in a computational time appropriate for an article search query engine. Results: We tested the capabilities of our system to retrieve MEDLINE references which are relevant to the subject of stem cells. We took advantage of the existing annotation of references with terms from the MeSH hierarchical vocabulary (Medical Subject Headings, developed at the National Library of Medicine). A training set of 81,416 references was constructed by selecting entries annotated with the MeSH term stem cells or some child in its sub tree. Frequencies of all nouns, verbs, and adjectives in the training set were computed and the ratios of word frequencies in the training set to those in the entire MEDLINE were used to score references. Self-consistency of the algorithm, benchmarked with a test set containing the training set and an equal number of references randomly selected from MEDLINE was better using nouns (79%) than adjectives (73%) or verbs (70%). The evaluation of the system with 6,923 references not used for training, containing 204 articles relevant to stem cells according to a human expert, indicated a recall of 65% for a precision of 65%. Conclusion: This strategy appears to be useful for predicting the relevance of MEDLINE references to a given concept. The method is simple and can be used with any user-defined training set. Choice of the part of speech of the words used for classification has important effects on performance. Lists of words, scripts, and additional information are available from the web address http://www.ogic.ca/projects/ks2004/.Keywords
This publication has 13 references indexed in Scilit:
- The Gene Ontology (GO) database and informatics resourceNucleic Acids Research, 2004
- Information extraction from full text scientific articles: Where are the keywords?BMC Bioinformatics, 2003
- The way we writeEMBO Reports, 2003
- Computing Fuzzy Associations for the Analysis of Biological LiteratureBioTechniques, 2002
- Conceptual biology: a semantic issue and moreNature, 2002
- CREATING KNOWLEDGE REPOSITORIES FROM BIOMEDICAL REPORTS: THE MEDSYNDIKATE TEXT MINING SYSTEMPacific Symposium on Biocomputing, 2001
- XplorMed: a tool for exploring MEDLINE abstractsTrends in Biochemical Sciences, 2001
- Automatic MeSH term assignment and quality assessment.2001
- Boosting naïve Bayesian learning on a large subset of MEDLINE.2000
- An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology textsComputers in Biology and Medicine, 1996