COOLCAT
Top Cited Papers
- 4 November 2002
- proceedings article
- Published by Association for Computing Machinery (ACM)
Abstract
In this paper we explore the connection between clustering categorical data and entropy: clusters of similar poi lower entropy than those of dissimilar ones. We use this connection to design an incremental heuristic algorithm, COOLCAT, which is capable of efficiently clustering large data sets of records with categorical attributes, and data streams. In contrast with other categorical clustering algorithms published in the past, COOLCAT's clustering results are very stable for different sample sizes and parameter settings. Also, the criteria for clustering is a very intuitive one, since it is deeply rooted on the well-known notion of entropy. Most importantly, COOLCAT is well equipped to deal with clustering of data streams(continuously arriving streams of data point) since it is an incremental algorithm capable of clustering new points without having to look at every point that has been clustered so far. We demonstrate the efficiency and scalability of COOLCAT by a series of experiments on real and synthetic data sets.Keywords
This publication has 12 references indexed in Scilit:
- Information theoretic clusteringPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Requirements for clustering data streamsACM SIGKDD Explorations Newsletter, 2002
- Using the fractal dimension to cluster datasetsPublished by Association for Computing Machinery (ACM) ,2000
- CUREPublished by Association for Computing Machinery (ACM) ,1998
- BIRCHPublished by Association for Computing Machinery (ACM) ,1996
- Cluster AnalysisPublished by SAGE Publications ,1984
- A Dendrite Method for Cluster AnalysisCommunications in Statistics - Simulation and Computation, 1974
- An Information Measure for ClassificationThe Computer Journal, 1968
- A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of ObservationsThe Annals of Mathematical Statistics, 1952
- A Mathematical Theory of CommunicationBell System Technical Journal, 1948