A probabilistic framework for semi-supervised clustering
Top Cited Papers
- 22 August 2004
- proceedings article
- Published by Association for Computing Machinery (ACM)
Abstract
Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supervision. Such methods use the constraints to either modify the objective function, or to learn the distance measure. We propose a probabilistic model for semi-supervised clustering based on Hidden Markov Random Fields (HMRFs) that provides a principled framework for incorporating supervision into prototype-based clustering. The model generalizes a previous approach that combines constraints and Euclidean distance learning, and allows the use of a broad range of clustering distortion measures, including Bregman divergences (e.g., Euclidean distance and I-divergence) and directional similarity measures (e.g., cosine similarity). We present an algorithm that performs partitional semi-supervised clustering of data by minimizing an objective function derived from the posterior energy of the HMRF model. Experimental results on several text data sets demonstrate the advantages of the proposed framework.Keywords
This publication has 16 references indexed in Scilit:
- Active Semi-Supervision for Pairwise Constrained ClusteringPublished by Society for Industrial & Applied Mathematics (SIAM) ,2004
- Clustering with Bregman DivergencesPublished by Society for Industrial & Applied Mathematics (SIAM) ,2004
- Adaptive duplicate detection using learnable string similarity measuresPublished by Association for Computing Machinery (ACM) ,2003
- Generative model-based clustering of directional dataPublished by Association for Computing Machinery (ACM) ,2003
- Discovering molecular pathways from protein interaction and gene expression dataBioinformatics, 2003
- Segmentation of brain MR images through a hidden Markov random field model and the expectation-maximization algorithmIEEE Transactions on Medical Imaging, 2001
- Localizing proteins in the cell from their phylogenetic profilesProceedings of the National Academy of Sciences, 2000
- Combining labeled and unlabeled data with co-trainingPublished by Association for Computing Machinery (ACM) ,1998
- Distributional clustering of English wordsPublished by Association for Computational Linguistics (ACL) ,1993
- A Best Possible Heuristic for the k-Center ProblemMathematics of Operations Research, 1985