Consensus clustering

[2] Consensus clustering is thus the problem of reconciling clustering information about the same data set coming from different sources or from different runs of the same algorithm.

When cast as an optimization problem, consensus clustering is known as median partition, and has been shown to be NP-complete,[3] even when the number of input clusterings is three.

There are potential shortcomings for all existing clustering techniques.

This may cause interpretation of results to become difficult, especially when there is no knowledge about the number of clusters.

Clustering methods are also very sensitive to the initial clustering settings, which can cause non-significant data to be amplified in non-reiterative methods.

Lacking an external objective criterion (the equivalent of a known class label in supervised analysis), this validation becomes somewhat elusive.

The method can also be used to represent the consensus over multiple runs of a clustering algorithm with random restart (such as K-means, model-based Bayesian clustering, SOM, etc.

It can provide data for a visualization tool to inspect cluster number, membership, and boundaries.

consensus matrix is calculated, where each element represents the fraction of times two samples clustered together.

A perfectly stable matrix would consist entirely of zeros and ones, representing all sample pairs always clustering together or not together over all resampling iterations.

The relative stability of the consensus matrices can be used to infer the optimal

connectivity matrix resulting from applying a clustering algorithm to the dataset

The indicator matrix is used to keep track of which samples were selected during each resampling iteration for the normalisation step.

is defined as the normalised sum of all connectivity matrices of all the perturbed datasets and a different one is calculated for every

in the consensus matrix is the number of times points

were clustered together divided by the total number of times they were selected together.

The matrix is symmetric and each element is defined within the range

th consensus matrix is examining its CDF curve (see below).

Monti consensus clustering can be a powerful tool for identifying clusters, but it needs to be applied with caution as shown by Şenbabaoğlu et al. [6] It has been shown that the Monti consensus clustering algorithm is able to claim apparent stability of chance partitioning of null datasets drawn from a unimodal distribution, and thus has the potential to lead to over-interpretation of cluster stability in a real study.

Identifying false positive clusters is a common problem throughout cluster research,[8] and has been addressed by methods such as SigClust[8] and the GAP-statistic.

[9] However, these methods rely on certain assumptions for the null model that may not always be appropriate.

Şenbabaoğlu et al [6] demonstrated the original delta K metric to decide

in the Monti algorithm performed poorly, and proposed a new superior metric for measuring the stability of consensus matrices using their CDF curves.

In the CDF curve of a consensus matrix, the lower left portion represents sample pairs rarely clustered together, the upper right portion represents those almost always clustered together, whereas the middle segment represent those with ambiguous assignments in different clustering runs.

The proportion of ambiguous clustering (PAC) score measure quantifies this middle segment; and is defined as the fraction of sample pairs with consensus indices falling in the interval (u1, u2) ∈ [0, 1] where u1 is a value close to 0 and u2 is a value close to 1 (for instance u1=0.1 and u2=0.9).

A low value of PAC indicates a flat middle segment, and a low rate of discordant assignments across permuted clustering runs.

[6][7] This approach by Strehl and Ghosh introduces the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitionings.

They discuss three approaches towards solving this problem to obtain high quality consensus functions.

Their techniques have low computational costs and this makes it feasible to evaluate each of the techniques discussed below and arrive at the best solution by comparing the results against the objective function.

Each instance in a soft ensemble is represented by a concatenation of r posterior membership probability distributions obtained from the constituent clustering algorithms.