How many clusters? An information theoretic perspective
Abstract
Clustering provides a common means of identifying structure in complex data, and there is renewed interest in clustering as a tool for analysis of modern genomic data. A natural question is how many clusters are appropriate for the description of a given system. Traditional approaches to this problem are based either on heuristic cross-validation methods or on a framework in which clusters of a particular shape are assumed as a model of the system. In a statistical mechanics approach, clustering can be seen as a trade off between energy- and entropy-like terms, with lower temperature driving the proliferation of clusters to provide a more detailed description of the data. We show that, in a general information theoretic framework, the finite size of a data set determines an optimal temperature and, in the hard clustering limit, an optimal number of clusters.Keywords
All Related Versions
This publication has 0 references indexed in Scilit: