DBSCAN

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu in 1996.

[3] As of July 2020[update], the follow-up paper "DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN"[4] appears in the list of the 8 most downloaded articles of the prestigious ACM Transactions on Database Systems (TODS) journal.

[5] Another follow-up, HDBSCAN*, was initially published by Ricardo J. G. Campello, David Moulavi, and Jörg Sander in 2013,[6] then expanded upon with Arthur Zimek in 2015.

[7] It revises some of the original decisions such as the border points, and produces a hierarchical instead of a flat result.

In 1972, Robert F. Ling published a closely related algorithm in "The Theory and Construction of k-Clusters"[8] in The Computer Journal with an estimated runtime complexity of O(n³).

Therefore, a further notion of connectedness is needed to formally define the extent of the clusters found by DBSCAN.

Then, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise.

The algorithm can be expressed in pseudocode as follows:[4] where RangeQuery can be implemented using a database index for better performance, or using a slow linear scan: The DBSCAN algorithm can be abstracted into the following steps:[4] A naive implementation of this requires storing the neighborhoods in step 1, thus requiring substantial memory.

The original DBSCAN algorithm does not require this by performing these steps for one point at a time.

DBSCAN visits each point of the database, possibly multiple times (e.g., as candidates to different clusters).

For practical considerations, however, the time complexity is mostly governed by the number of regionQuery invocations.

Without the use of an accelerating index structure, or on degenerated data (e.g. all points within a distance less than ε), the worst case run time complexity remains O(n²).

- n = (n²-n)/2-sized upper triangle of the distance matrix can be materialized to avoid distance recomputations, but this needs O(n²) memory, whereas a non-matrix based implementation of DBSCAN only needs O(n) memory.

Ideally, the value of ε is given by the problem to solve (e.g. a physical distance), and minPts is then the desired minimum cluster size.

[a] OPTICS can be seen as a generalization of DBSCAN that replaces the ε parameter with a maximum value that mostly affects performance.

For performance reasons, the original DBSCAN algorithm remains preferable to its spectral implementation.

The ε and minPts parameters are removed from the original algorithm and moved to the predicates.

Various extensions to the DBSCAN algorithm have been proposed, including methods for parallelization, parameter estimation, and support for uncertain data.

HDBSCAN*[6][7] is a hierarchical version of DBSCAN which is also faster than OPTICS, from which a flat partition consisting of the most prominent clusters can be extracted from the hierarchy.

[14] Different implementations of the same algorithm were found to exhibit enormous performance differences, with the fastest on a test data set finishing in 1.4 seconds, the slowest taking 13803 seconds.

In this diagram, minPts = 4 . Point A and the other red points are core points, because the area surrounding these points in an ε radius contain at least 4 points (including the point itself). Because they are all reachable from one another, they form a single cluster. Points B and C are not core points, but are reachable from A (via other core points) and thus belong to the cluster as well. Point N is a noise point that is neither a core point nor directly-reachable.
DBSCAN can find non-linearly separable clusters. This dataset cannot be adequately clustered with k-means or Gaussian Mixture EM clustering.