A general model for clustering binary data
- 21 August 2005
- proceedings article
- Published by Association for Computing Machinery (ACM)
- p. 188-197
- https://doi.org/10.1145/1081870.1081894
Abstract
Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. This is the case for market basket datasets where the transactions contain items and for document datasets where the documents contain "bag of words". The contribution of the paper is three-fold. First a general binary data clustering model is presented. The model treats the data and features equally, based on their symmetric association relations, and explicitly describes the data assignments as well as feature assignments. We characterize several variations with different optimization procedures for the general model. Second, we also establish the connections between our clustering model with other existing clustering methods. Third, we also discuss the problem for determining the number of clusters for binary clustering. Experimental results show the effectiveness of the proposed clustering model.Keywords
This publication has 28 references indexed in Scilit:
- IFD: Iterative Feature and Data ClusteringPublished by Society for Industrial & Applied Mathematics (SIAM) ,2004
- Minimum Sum-Squared Residue Co-clustering of Gene Expression DataPublished by Society for Industrial & Applied Mathematics (SIAM) ,2004
- Locally adaptive metric nearest-neighbor classificationPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Identification of almost invariant aggregates in reversible nearly uncoupled Markov chainsLinear Algebra and its Applications, 2000
- Authoritative sources in a hyperlinked environmentJournal of the ACM, 1999
- Probability Matrix Decomposition ModelsPsychometrika, 1996
- Positive matrix factorization: A non‐negative factor model with optimal utilization of error estimates of data valuesEnvironmetrics, 1994
- Gennclus: New Models for General Nonhierarchical Clustering AnalysisPsychometrika, 1982
- Additive clustering: Representation of similarities as combinations of discrete overlapping properties.Psychological Review, 1979
- Modeling by shortest data descriptionAutomatica, 1978