Correlation clustering

Clustering is the problem of partitioning data points into groups based on their similarity.

[1] In machine learning, correlation clustering or cluster editing operates in a scenario where the relationships between the objects are known instead of the actual representations of the objects.

where the edge weight indicates whether two nodes are similar (positive edge weight) or different (negative edge weight), the task is to find a clustering that either maximizes agreements (sum of positive edge weights within a cluster plus the absolute value of the sum of negative edge weights between clusters) or minimizes disagreements (absolute value of the sum of negative edge weights within a cluster plus the sum of positive edge weights across clusters).

in advance because the objective, to minimize the sum of weights of the cut edges, is independent of the number of clusters.

If the graph indeed admits a perfect clustering, then simply deleting all the negative edges and finding the connected components in the remaining graph will return the required clusters.

Davis found a necessary and sufficient condition for this to occur: no cycle may contain exactly one negative edge.

For example, given nodes a,b,c such that a,b and a,c are similar while b,c are dissimilar, a perfect clustering is not possible.

be a function that assigns a non-negative weight to each edge of the graph and let

contains the attractive edges whose endpoints are in different components with respect to the clustering

contains the repulsive edges whose endpoints are in the same component with respect to the clustering

Similarly to the minimum disagreement correlation clustering problem, the maximum agreement correlation clustering problem is defined as

contains the attractive edges whose endpoints are in the same component with respect to the clustering

contains the repulsive edges whose endpoints are in different components with respect to the clustering

such that the sum of the costs of the edges whose endpoints are in different clusters is minimal:

Similar to the minimum cost multicut problem, coalition structure generation in weighted graph games[5] is the problem of finding a clustering such that the sum of the costs of the edges that are not cut is maximal:

Bansal et al.[7] discuss the NP-completeness proof and also present both a constant factor approximation algorithm and polynomial-time approximation scheme to find the clusters in this setting.

Ailon et al.[8] propose a randomized 3-approximation algorithm for the same problem.

The best polynomial-time approximation algorithm known at the moment for this problem achieves a ~2.06 approximation by rounding a linear program, as shown by Chawla, Makarychev, Schramm, and Yaroslavtsev.

[9] Karpinski and Schudy[10] proved existence of a polynomial time approximation scheme (PTAS) for that problem on complete graphs and fixed number of clusters.

In 2011, it was shown by Bagon and Galun[11] that the optimization of the correlation clustering functional is closely related to well known discrete optimization methods.

In their work they proposed a probabilistic analysis of the underlying implicit model that allows the correlation clustering functional to estimate the underlying number of clusters.

This analysis suggests the functional assumes a uniform prior over all possible partitions regardless of their number of clusters.

Thus, a non-uniform prior over the number of clusters emerges.

Several discrete optimization algorithms are proposed in this work that scales gracefully with the number of elements (experiments show results with more than 100,000 variables).

The work of Bagon and Galun also evaluated the effectiveness of the recovery of the underlying number of clusters in several applications.

Correlation clustering also relates to a different task, where correlations among attributes of feature vectors in a high-dimensional space are assumed to exist guiding the clustering process.

Correlations among subsets of attributes result in different spatial shapes of clusters.

Hence, the similarity between cluster objects is defined by taking into account the local correlation patterns.

Correlation clustering (according to this definition) can be shown to be closely related to biclustering.

As in biclustering, the goal is to identify groups of objects that share a correlation in some of their attributes; where the correlation is usually typical for the individual clusters.