Corpus-based Statistical Screening for Phrase Identification
Open Access
- 1 September 2000
- journal article
- Published by Oxford University Press (OUP) in Journal of the American Medical Informatics Association
- Vol. 7 (5) , 499-511
- https://doi.org/10.1136/jamia.2000.0070499
Abstract
Purpose: The authors study the extraction of useful phrases from a natural language database by statistical methods. The aim is to leverage human effort by providing preprocessed phrase lists with a high percentage of useful material. Method: The approach is to develop six different scoring methods that are based on different aspects of phrase occurrence. The emphasis here is not on lexical information or syntactic structure but rather on the statistical properties of word pairs and triples that can be obtained from a large database. Measurements: The Unified Medical Language System (UMLS) incorporates a large list of humanly acceptable phrases in the medical field as a part of its structure. The authors use this list of phrases as a gold standard for validating their methods. A good method is one that ranks the UMLS phrases high among all phrases studied. Measurements are 11-point average precision values and precision-recall curves based on the rankings. Result: The authors find of six different scoring methods that each proves effective in identifying UMLS quality phrases in a large subset of MEDLINE. These methods are applicable both to word pairs and word triples. All six methods are optimally combined to produce composite scoring methods that are more effective than any single method. The quality of the composite methods appears sufficient to support the automatic placement of hyperlinks in text at the site of highly ranked phrases. Conclusion: Statistical scoring methods provide a promising approach to the extraction of useful phrases from a natural language database for the purpose of indexing or providing hyperlinks in text.Keywords
This publication has 23 references indexed in Scilit:
- Indexing and access for digital libraries and the internet: Human, database, and domain factorsJournal of the American Society for Information Science, 1998
- The Unified Medical Language System: An Informatics Research CollaborationJournal of the American Medical Informatics Association, 1998
- UMLS-based Conceptual Queries to Biomedical Information Databases: An Overview of the Project ARIANEJournal of the American Medical Informatics Association, 1998
- An information measure of retrieval performanceInformation Systems, 1992
- The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrievalJournal of the American Society for Information Science, 1989
- A probabilistic approach to automatic keyword indexing. Part I. On the Distribution of Specialty Words in a Technical LiteratureJournal of the American Society for Information Science, 1975
- Statistical generation of a technical vocabularyAmerican Documentation, 1968
- Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systemsAmerican Documentation, 1968
- An experiment in automatic indexingAmerican Documentation, 1965
- A new method of recording and searching informationAmerican Documentation, 1953