Document clustering with committees
- 11 August 2002
- proceedings article
- Published by Association for Computing Machinery (ACM)
- p. 199-206
- https://doi.org/10.1145/564376.564412
Abstract
Document clustering is useful in many information retrieval tasks: document browsing, organization and viewing of retrieval results, generation of Yahoo-like hierarchies of documents, etc. The general goal of clustering is to group data elements such that the intra-group similarities are high and the inter-group similarities are low. We present a clustering algorithm called CBC (Clustering By Committee) that is shown to produce higher quality clusters in document clustering tasks as compared to several well known clustering algorithms. It initially discovers a set of tight clusters (high intra-group similarity), called committees, that are well scattered in the similarity space (low inter-group similarity). The union of the committees is but a subset of all elements. The algorithm proceeds by assigning elements to their most similar committee. Evaluating cluster quality has always been a difficult task. We present a new evaluation methodology that is based on the editing distance between output clusters and manually constructed classes (the answer key). This evaluation measure is more intuitive and easier to interpret than previous evaluation measures.Keywords
This publication has 9 references indexed in Scilit:
- Data clusteringACM Computing Surveys, 1999
- Chameleon: hierarchical clustering using dynamic modelingComputer, 1999
- Reexamining the cluster hypothesisPublished by Association for Computing Machinery (ACM) ,1996
- Scatter/Gather: a cluster-based approach to browsing large document collectionsPublished by Association for Computing Machinery (ACM) ,1992
- Word association norms, mutual information, and lexicographyPublished by Association for Computational Linguistics (ACL) ,1989
- Optimization of inverted vector searchesPublished by Association for Computing Machinery (ACM) ,1985
- An algorithm for suffix strippingProgram: electronic library and information systems, 1980
- The use of hierarchic clustering in information retrievalInformation Storage and Retrieval, 1971
- Step-Wise Clustering ProceduresJournal of the American Statistical Association, 1967