Correlation clustering
- 1 January 2002
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- No. 02725428,p. 238-247
- https://doi.org/10.1109/sfcs.2002.1181947
Abstract
We consider the following clustering problem: we have a complete graph on n vertices (items), where each edge (u, /spl upsi/) is labeled either + or - depending on whether a and /spl upsi/ have been deemed to be similar or different. The goal is to produce a partition of the vertices (a clustering) that agrees as much as possible with the edge labels. That is, we want a clustering that maximizes the number of + edges within clusters, plus the number of - edges between clusters (equivalently, minimizes the number of disagreements: the number of - edges inside clusters plus the number of + edges between clusters). This formulation is motivated from a document clustering problem in which one has a pairwise similarity function f learned from past data, and the goal is to partition the current set of documents in a way that correlates with f as much as possible; it can also be viewed as a kind of "agnostic learning" problem. An interesting feature of this clustering formulation is that one does not need to specify the number of clusters k as a separate parameter, as in measures such as k-median or min-sum or min-max clustering. Instead, in our formulation, the optimal number of clusters could be any value between 1 and n, depending on the edge labels. We look at approximation algorithms for both minimizing disagreements and for maximizing agreements. For minimizing disagreements, we give a constant factor approximation. For maximizing agreements we give a PTAS. We also show how to extend some of these results to graphs with edge labels in [-1, +1], and give some results for the case of random noise.Keywords
This publication has 17 references indexed in Scilit:
- Efficient testing of large graphsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- Improved combinatorial algorithms for the facility location and k-median problemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- A new rounding procedure for the assignment problem with applications to dense graph arrangement problemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Agnostic BoostingPublished by Springer Nature ,2001
- Clustering for edge-cost minimization (extended abstract)Published by Association for Computing Machinery (ACM) ,2000
- Testing the Diameter of GraphsPublished by Springer Nature ,1999
- Property testing and its connection to learning and approximationJournal of the ACM, 1998
- Toward efficient agnostic learningMachine Learning, 1994
- Efficient noise-tolerant learning from statistical queriesPublished by Association for Computing Machinery (ACM) ,1993
- A unified approach to approximation algorithms for bottleneck problemsJournal of the ACM, 1986