Document clustering with committees

11 August 2002

proceedings article
Published by Association for Computing Machinery (ACM)

p. 199-206
https://doi.org/10.1145/564376.564412

Abstract

Document clustering is useful in many information retrieval tasks: document browsing, organization and viewing of retrieval results, generation of Yahoo-like hierarchies of documents, etc. The general goal of clustering is to group data elements such that the intra-group similarities are high and the inter-group similarities are low. We present a clustering algorithm called CBC (Clustering By Committee) that is shown to produce higher quality clusters in document clustering tasks as compared to several well known clustering algorithms. It initially discovers a set of tight clusters (high intra-group similarity), called committees, that are well scattered in the similarity space (low inter-group similarity). The union of the committees is but a subset of all elements. The algorithm proceeds by assigning elements to their most similar committee. Evaluating cluster quality has always been a difficult task. We present a new evaluation methodology that is based on the editing distance between output clusters and manually constructed classes (the answer key). This evaluation measure is more intuitive and easier to interpret than previous evaluation measures.

Keywords

This publication has 9 references indexed in Scilit:

Data clustering
ACM Computing Surveys, 1999
Chameleon: hierarchical clustering using dynamic modeling
Computer, 1999
Reexamining the cluster hypothesis
Published by Association for Computing Machinery (ACM) ,1996
Scatter/Gather: a cluster-based approach to browsing large document collections
Published by Association for Computing Machinery (ACM) ,1992
Word association norms, mutual information, and lexicography
Published by Association for Computational Linguistics (ACL) ,1989
Optimization of inverted vector searches
Published by Association for Computing Machinery (ACM) ,1985
An algorithm for suffix stripping
Program: electronic library and information systems, 1980
The use of hierarchic clustering in information retrieval
Information Storage and Retrieval, 1971
Step-Wise Clustering Procedures
Journal of the American Statistical Association, 1967