Correlation clustering

1 January 2002

conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

No. 02725428,p. 238-247
https://doi.org/10.1109/sfcs.2002.1181947

Abstract

We consider the following clustering problem: we have a complete graph on n vertices (items), where each edge (u, /spl upsi/) is labeled either + or - depending on whether a and /spl upsi/ have been deemed to be similar or different. The goal is to produce a partition of the vertices (a clustering) that agrees as much as possible with the edge labels. That is, we want a clustering that maximizes the number of + edges within clusters, plus the number of - edges between clusters (equivalently, minimizes the number of disagreements: the number of - edges inside clusters plus the number of + edges between clusters). This formulation is motivated from a document clustering problem in which one has a pairwise similarity function f learned from past data, and the goal is to partition the current set of documents in a way that correlates with f as much as possible; it can also be viewed as a kind of "agnostic learning" problem. An interesting feature of this clustering formulation is that one does not need to specify the number of clusters k as a separate parameter, as in measures such as k-median or min-sum or min-max clustering. Instead, in our formulation, the optimal number of clusters could be any value between 1 and n, depending on the edge labels. We look at approximation algorithms for both minimizing disagreements and for maximizing agreements. For minimizing disagreements, we give a constant factor approximation. For maximizing agreements we give a PTAS. We also show how to extend some of these results to graphs with edge labels in [-1, +1], and give some results for the case of random noise.

Keywords

This publication has 17 references indexed in Scilit:

Efficient testing of large graphs
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2003
Improved combinatorial algorithms for the facility location and k-median problems
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2003
A new rounding procedure for the assignment problem with applications to dense graph arrangement problems
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Agnostic Boosting
Published by Springer Nature ,2001
Clustering for edge-cost minimization (extended abstract)
Published by Association for Computing Machinery (ACM) ,2000
Testing the Diameter of Graphs
Published by Springer Nature ,1999
Property testing and its connection to learning and approximation
Journal of the ACM, 1998
Toward efficient agnostic learning
Machine Learning, 1994
Efficient noise-tolerant learning from statistical queries
Published by Association for Computing Machinery (ACM) ,1993
A unified approach to approximation algorithms for bottleneck problems
Journal of the ACM, 1986