Techniques for the measurement of clustering tendency in document retrieval systems

1 December 1987

journal article
other
Published by SAGE Publications in Journal of Information Science

Vol. 13 (6) , 361-365
https://doi.org/10.1177/016555158701300607

Abstract

The use of automatic classification techniques has been suggested as a means of increasing the effectiveness of docu ment retrieval systems; however, the automatic generation of a classification requires a large amount of computation, and it is thus of importance to know whether this computation will result in material increases in retrieval performance. This paper describes three methods - the overlap test, the nearest neighbour test and the density test - which can be used to measure the degree of clustering tendency in a set of docu ments. It is shown that the three tests are not in complete agreement with each other in their evaluation of the degree of clustering tendency present in seven document test collections. A comparison of the predicted degree of clustering tendency with the relative effectiveness of cluster and non-cluster searches suggests that the density test gives the most useful results; it also has the advantage that it does not require query and relevance data and can thus be used in a predictive manner when a document collection is to be processed for the first time.

Keywords

This publication has 9 references indexed in Scilit:

An investigation of document partitions
Information Processing & Management, 1986
Criteria for the selection of search strategies in best-match document-retrieval systems
International Journal of Man-Machine Studies, 1986
Using interdocument similarity information in document retrieval systems
Journal of the American Society for Information Science, 1986
Two partitioning type clustering algorithms
Journal of the American Society for Information Science, 1984
A Survey of Recent Advances in Hierarchical Clustering Algorithms
The Computer Journal, 1983
A model of cluster searching based on classification
Information Systems, 1980
Indexing exhaustivity and the computation of similarity matrices
Journal of the American Society for Information Science, 1980
A TEST FOR THE SEPARATION OF RELEVANT AND NON‐RELEVANT DOCUMENTS IN EXPERIMENTAL RETRIEVAL COLLECTIONS
Journal of Documentation, 1973
The use of hierarchic clustering in information retrieval
Information Storage and Retrieval, 1971