The automatic identification of stop words

1 February 1992

journal article
Published by SAGE Publications in Journal of Information Science

Vol. 18 (1) , 45-55
https://doi.org/10.1177/016555159201800106

Abstract

A stop word may be identified as a word that has the same likehhood of occurring in those documents not relevant to a query as in those documents relevant to the query. In this paper we show how the concept of relevance may be replaced by the condition of being highly rated by a similarity measure. Thus it becomes possible to identify the stop words in a cullectmn by automated statistical testing. We describe the nature of the statistical test as it is realized with a vector retrieval methodology based on the cosine coefficient of document-document similarity. As an example, this tech nique is then applied to a large MEDLINE " subset in the area of biotechnology. The initial processing of this datahase involves a 310 word stop list of common non-content terms. Our technique is then applied and 75% of the remaining terms are identified as stop words. We compare retrieval with and without the removal of these stop words and find that of the top twenty documents retrieved in response to a random query document. seventeen of these are the same on the average for the two methods We also examine the differences and conclude that where the user prefers one method over the other, the new method with the reduced term set is favored about three times out of four.

Keywords

This publication has 6 references indexed in Scilit:

Performance measures for information retrieval systems—an experimental approach
Journal of the American Society for Information Science, 1988
Historical note: Information retrieval and the future of an illusion
Journal of the American Society for Information Science, 1988
A document retrieval system based on nearest neighbour searching
Journal of Information Science, 1988
An algorithm for suffix stripping
Program: electronic library and information systems, 1980
A vector space model for automatic indexing
Communications of the ACM, 1975
A STATISTICAL INTERPRETATION OF TERM SPECIFICITY AND ITS APPLICATION IN RETRIEVAL
Journal of Documentation, 1972