Information-theoretic co-clustering

Top Cited Papers

24 August 2003

proceedings article
Published by Association for Computing Machinery (ACM)

p. 89-98
https://doi.org/10.1145/956750.956764

Abstract

Two-dimensional contingency or co-occurrence tables arise frequently in important applications such as text, web-log and market-basket data analysis. A basic problem in contingency table analysis is co-clustering: simultaneous clustering of the rows and columns. A novel theoretical formulation views the contingency table as an empirical joint probability distribution of two discrete random variables and poses the co-clustering problem as an optimization problem in information theory---the optimal co-clustering maximizes the mutual information between the clustered random variables subject to constraints on the number of row and column clusters. We present an innovative co-clustering algorithm that monotonically increases the preserved mutual information by intertwining both the row and column clusterings at all stages. Using the practical example of simultaneous word-document clustering, we demonstrate that our algorithm works well in practice, especially in the presence of sparsity and high-dimensionality.

Keywords

This publication has 8 references indexed in Scilit:

Co-clustering documents and words using bipartite spectral graph partitioning
Published by Association for Computing Machinery (ACM) ,2001
Efficient Clustering of Very Large Document Collections
Published by Springer Nature ,2001
Concept Decompositions for Large Sparse Text Data Using Clustering
Machine Learning, 2001
Document clustering using word clusters via the information bottleneck method
Published by Association for Computing Machinery (ACM) ,2000
Probabilistic latent semantic indexing
Published by Association for Computing Machinery (ACM) ,1999
Mathematical Classification and Clustering
Published by Springer Nature ,1996
Direct Clustering of a Data Matrix
Journal of the American Statistical Association, 1972
On the Interpretation of χ 2 from Contingency Tables, and the Calculation of P
Journal of the Royal Statistical Society, 1922