k-means++

It is similar to the first of three seeding methods proposed, in independent work, in 2006[3] by Rafail Ostrovsky, Yuval Rabani, Leonard Schulman and Chaitanya Swamy.

To illustrate the potential of the k-means algorithm to perform arbitrarily poorly with respect to the objective function of minimizing the sum of squared distances of cluster points to the centroid of their assigned clusters, consider the example of four points in

and the two initial cluster centers lie at the midpoints of the top and bottom line segments of the rectangle formed by the four data points, the k-means algorithm converges immediately, without moving these cluster centers.

The standard k-means algorithm will continue to cluster the points suboptimally, and by increasing the horizontal distance between the two data points in each cluster, we can make the algorithm perform arbitrarily poorly with respect to the k-means objective function.

The exact algorithm is as follows: This seeding method yields considerable improvement in the final error of k-means.

This is in contrast to vanilla k-means, which can generate clusterings arbitrarily worse than the optimum.

Lee et al.[9] report an application of k-means++ to create geographical cluster of photographs based on the latitude and longitude information attached to the photos.

Bad clustering of a rectangle
This is a bad clustering where the points A and D are in the red cluster of centroid E and the points B and C are in the blue cluster of centroid F, as the intra-cluster distance in not minimal
Optimal clustering for the problem.