Corpus-based stemming using cooccurrence of word variants
- 1 January 1998
- journal article
- Published by Association for Computing Machinery (ACM) in ACM Transactions on Information Systems
- Vol. 16 (1) , 61-81
- https://doi.org/10.1145/267954.267957
Abstract
Stemming is used in many information retrieval (IR) systems to reduce variant word forms to common roots. It is one of the simplest applications of natural-language processing to IR and is one of the most effective in terms of user acceptance and consistency, though small retrieval improvements. Current stemming techniques do not, however, reflect the language use in specific corpora, and this can lead to occasional serious retrieval failures. We propose a technique for using corpus-based word variant cooccurrence statistics to modify or create a stemmer. The experimental results generated using English newspaper and legal text and Spanish text demonstrate the viability of this technique and its advantages relative to conventional approaches that only employ morphological rules.Keywords
This publication has 4 references indexed in Scilit:
- Stemming algorithms: A case study for detailed evaluationJournal of the American Society for Information Science, 1996
- The effectiveness of stemming for natural-language access to Slovene textual dataJournal of the American Society for Information Science, 1992
- How effective is suffixing?Journal of the American Society for Information Science, 1991
- An algorithm for suffix strippingProgram: electronic library and information systems, 1980