A general model for clustering binary data

21 August 2005

proceedings article
Published by Association for Computing Machinery (ACM)

p. 188-197
https://doi.org/10.1145/1081870.1081894

Abstract

Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. This is the case for market basket datasets where the transactions contain items and for document datasets where the documents contain "bag of words". The contribution of the paper is three-fold. First a general binary data clustering model is presented. The model treats the data and features equally, based on their symmetric association relations, and explicitly describes the data assignments as well as feature assignments. We characterize several variations with different optimization procedures for the general model. Second, we also establish the connections between our clustering model with other existing clustering methods. Third, we also discuss the problem for determining the number of clusters for binary clustering. Experimental results show the effectiveness of the proposed clustering model.

Keywords

This publication has 28 references indexed in Scilit:

IFD: Iterative Feature and Data Clustering
Published by Society for Industrial & Applied Mathematics (SIAM) ,2004
Minimum Sum-Squared Residue Co-clustering of Gene Expression Data
Published by Society for Industrial & Applied Mathematics (SIAM) ,2004
Locally adaptive metric nearest-neighbor classification
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Identification of almost invariant aggregates in reversible nearly uncoupled Markov chains
Linear Algebra and its Applications, 2000
Authoritative sources in a hyperlinked environment
Journal of the ACM, 1999
Probability Matrix Decomposition Models
Psychometrika, 1996
Positive matrix factorization: A non‐negative factor model with optimal utilization of error estimates of data values
Environmetrics, 1994
Gennclus: New Models for General Nonhierarchical Clustering Analysis
Psychometrika, 1982
Additive clustering: Representation of similarities as combinations of discrete overlapping properties.
Psychological Review, 1979
Modeling by shortest data description
Automatica, 1978