Information-theoretic co-clustering
Top Cited Papers
- 24 August 2003
- proceedings article
- Published by Association for Computing Machinery (ACM)
Abstract
Two-dimensional contingency or co-occurrence tables arise frequently in important applications such as text, web-log and market-basket data analysis. A basic problem in contingency table analysis is co-clustering: simultaneous clustering of the rows and columns. A novel theoretical formulation views the contingency table as an empirical joint probability distribution of two discrete random variables and poses the co-clustering problem as an optimization problem in information theory---the optimal co-clustering maximizes the mutual information between the clustered random variables subject to constraints on the number of row and column clusters. We present an innovative co-clustering algorithm that monotonically increases the preserved mutual information by intertwining both the row and column clusterings at all stages. Using the practical example of simultaneous word-document clustering, we demonstrate that our algorithm works well in practice, especially in the presence of sparsity and high-dimensionality.Keywords
This publication has 8 references indexed in Scilit:
- Co-clustering documents and words using bipartite spectral graph partitioningPublished by Association for Computing Machinery (ACM) ,2001
- Efficient Clustering of Very Large Document CollectionsPublished by Springer Nature ,2001
- Concept Decompositions for Large Sparse Text Data Using ClusteringMachine Learning, 2001
- Document clustering using word clusters via the information bottleneck methodPublished by Association for Computing Machinery (ACM) ,2000
- Probabilistic latent semantic indexingPublished by Association for Computing Machinery (ACM) ,1999
- Mathematical Classification and ClusteringPublished by Springer Nature ,1996
- Direct Clustering of a Data MatrixJournal of the American Statistical Association, 1972
- On the Interpretation of χ 2 from Contingency Tables, and the Calculation of PJournal of the Royal Statistical Society, 1922