How Many Clusters? An Information-Theoretic Perspective

1 December 2004

journal article
Published by MIT Press in Neural Computation

Vol. 16 (12) , 2483-2506
https://doi.org/10.1162/0899766042321751

Abstract

Clustering provides a common means of identifying structure in complex data, and there is renewed interest in clustering as a tool for the analysis of large data sets in many fields. A natural question is how many clusters are appropriate for the description of a given system. Traditional approaches to this problem are based on either a framework in which clusters of a particular shape are assumed as a model of the system or on a two-step procedure in which a clustering criterion determines the optimal assignments for a given number of clusters and a separate criterion measures the goodness of the classification to determine the number of clusters. In a statistical mechanics approach, clustering can be seen as a trade-off between energy- and entropy-like terms, with lower temperature driving the proliferation of clusters to provide a more detailed description of the data. For finite data sets, we expect that there is a limit to the meaningful structure that can be resolved and therefore a minimum temperature beyond which we will capture sampling noise. This suggests that correcting the clustering criterion for the bias that arises due to sampling errors will allow us to find a clustering solution at a temperature that is optimal in the sense that we capture maximal meaningful structure—without having to define an external criterion for the goodness or stability of the clustering. We show that in a general information-theoretic framework, the finite size of a data set determines an optimal temperature, and we introduce a method for finding the maximal number of clusters that can be resolved from the data in the hard clustering limit.

Keywords

All Related Versions

This publication has 14 references indexed in Scilit:

Stability-Based Validation of Clustering Solutions
Neural Computation, 2004
Model-Based Clustering, Discriminant Analysis, and Density Estimation
Journal of the American Statistical Association, 2002
Algorithm for Data Clustering in Pattern Recognition Problems Based on Quantum Mechanics
Physical Review Letters, 2001
Estimating the Number of Clusters in a Data Set Via the Gap Statistic
Journal of the Royal Statistical Society Series B: Statistical Methodology, 2001
Model selection for probabilistic clustering using cross-validated likelihood
Statistics and Computing, 2000
Statistical Inference, Occam's Razor, and Statistical Mechanics on the Space of Probability Distributions
Neural Computation, 1997
Superparamagnetic Clustering of Data
Physical Review Letters, 1996
Statistical mechanics and phase transitions in clustering
Physical Review Letters, 1990
On stochastic complexity and nonparametric density estimation
Biometrika, 1988
A Mathematical Theory of Communication
Bell System Technical Journal, 1948