k-means clustering

The problem is computationally difficult (NP-hard); however, efficient heuristic algorithms converge quickly to a local optimum.

-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, ..., Sk} so as to minimize the within-cluster sum of squares (WCSS) (i.e. variance).

[3] The standard algorithm was first proposed by Stuart Lloyd of Bell Labs in 1957 as a technique for pulse-code modulation, although it was not published as a journal article until 1982.

[6] Given an initial set of k means m1(1), ..., mk(1) (see below), the algorithm proceeds by alternating between two steps:[7] The objective function in k-means is the WCSS (within cluster sum of squares).

The Forgy method tends to spread the initial means out, while Random Partition places all of them close to the center of the data set.

According to Hamerly et al.,[10] the Random Partition method is generally preferable for algorithms such as the k-harmonic means and fuzzy k-means.

[13] These point sets do not seem to arise in practice: this is corroborated by the fact that the smoothed running time of k-means is polynomial.

Lloyd's algorithm is therefore often considered to be of "linear" complexity in practice, although it is in the worst case superpolynomial when performed until convergence.

However, it spends a lot of processing time computing the distances between each of the k cluster centers and the n data points.

Since points usually stay in the same clusters after a few iterations, much of this work is unnecessary, making the naïve implementation very inefficient.

The method is a local search that iteratively attempts to relocate a sample into a different cluster as long as this process improves the objective function.

When no sample can be relocated into a different cluster with an improvement of the objective, the method stops (in a local minimum).

In a similar way as the classical k-means, the approach remains a heuristic since it does not necessarily guarantee that the final solution is globally optimum.

The classical k-means algorithm and its variations are known to only converge to local minima of the minimum-sum-of-squares clustering problem defined as

Many studies have attempted to improve the convergence behavior of the algorithm and maximize the chances of attaining the global optimum (or at least, local minima of better quality).

Optimal solutions for small- and medium-scale still remain valuable as a benchmark tool, to evaluate the quality of other heuristics.

In counterpart, EM requires the optimization of a larger number of free parameters and poses some methodological issues due to vanishing clusters or badly-conditioned covariance matrices.

[53] k-means clustering is rather easy to apply to even large data sets, particularly when using heuristics such as Lloyd's algorithm.

By applying k-means clustering with k set to a smaller number, the image can be represented using a more limited color palette, resulting in a compressed version that consumes less storage space and bandwidth.

Other uses of vector quantization include non-random sampling, as k-means can easily be used to choose k different but prototypical objects from a large data set for further analysis.

[56] Alternatively, transforming the sample-cluster distance through a Gaussian RBF, obtains the hidden layer of a radial basis function network.

[57] This use of k-means has been successfully combined with simple, linear classifiers for semi-supervised learning in NLP (specifically for named-entity recognition)[58] and in computer vision.

On an object recognition task, it was found to exhibit comparable performance with more sophisticated feature learning approaches such as autoencoders and restricted Boltzmann machines.

[54] Example: In natural language processing (NLP), k-means clustering has been integrated with simple linear classifiers for semi-supervised learning tasks such as named-entity recognition (NER).

By first clustering unlabeled text data using k-means, meaningful features can be extracted to improve the performance of NER models.

This approach has been shown to achieve comparable performance with more complex feature learning techniques such as autoencoders and restricted Boltzmann machines, albeit with a greater requirement for labeled data.

Additionally, researchers have explored the integration of k-means clustering with deep learning methods, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), to enhance the performance of various tasks in computer vision, natural language processing, and other domains.

[64] It is straightforward to produce counterexamples to the statement that the cluster centroid subspace is spanned by the principal directions.

Under sparsity assumptions and when input data is pre-processed with the whitening transformation, k-means produces the solution to the linear independent component analysis (ICA) task.

However, the bilateral filter restricts the calculation of the (kernel weighted) mean to include only points that are close in the ordering of the input data.

Example image with only red and green channel (for illustration purposes)