Document clustering with cluster refinement and model selection capabilities
- 11 August 2002
- proceedings article
- Published by Association for Computing Machinery (ACM)
- p. 191-198
- https://doi.org/10.1145/564376.564411
Abstract
In this paper, we propose a document clustering method that strives to achieve: (1) a high accuracy of document clustering, and (2) the capability of estimating the number of clusters in the document corpus (i.e. the model selection capability). To accurately cluster the given document corpus, we employ a richer feature set to represent each document, and use the Gaussian Mixture Model (GMM) together with the Expectation-Maximization (EM) algorithm to conduct an initial document clustering. From this initial result, we identify a set of discriminative featuresfor each cluster, and refine the initially obtained document clusters by voting on the cluster label of each document using this discriminative feature set. This self-refinement process of discriminative feature identification and cluster label voting is iteratively applied until the convergence of document clusters. On the other hand, the model selection capability is achieved by introducing randomness in the cluster initialization stage, and then discovering a value C for the number of clusters N by which running the document clustering process for a fixed number of times yields sufficiently similar results. Performance evaluations exhibit clear superiority of the proposed method with its improved document clustering and model selection accuracies. The evaluations also demonstrate how each feature as well as the cluster refinement process contribute to the document clustering accuracy.Keywords
This publication has 9 references indexed in Scilit:
- On-line new event detection and trackingPublished by Association for Computing Machinery (ACM) ,1998
- A study of retrospective and on-line event detectionPublished by Association for Computing Machinery (ACM) ,1998
- Distributional clustering of words for text classificationPublished by Association for Computing Machinery (ACM) ,1998
- Distributional clustering of English wordsPublished by Association for Computational Linguistics (ACL) ,1993
- Scatter/Gather: a cluster-based approach to browsing large document collectionsPublished by Association for Computing Machinery (ACM) ,1992
- Identifying word correspondence in parallel textsPublished by Association for Computational Linguistics (ACL) ,1991
- Recent trends in hierarchic document clustering: A critical reviewInformation Processing & Management, 1988
- Document clustering using an inverted file approachJournal of Information Science, 1980
- Clustering large files of documents using the single‐link methodJournal of the American Society for Information Science, 1977