Nearest neighbor search

3 of The Art of Computer Programming (1973) called it the post-office problem, referring to an application of assigning to a residence the nearest post office.

A direct generalization of this problem is a k-NN search, where we need to find the k closest points.

One example is asymmetric Bregman divergence, for which the triangle inequality does not hold.

The informal observation usually referred to as the curse of dimensionality states that there is no general-purpose exact solution for NNS in high-dimensional Euclidean space using polynomial preprocessing and polylogarithmic search time.

This algorithm, sometimes referred to as the naive approach, has a running time of O(dN), where N is the cardinality of S and d is the dimensionality of S. There are no search data structures to maintain, so the linear search has no space complexity beyond the storage of the database.

For constant dimension query time, average complexity is O(log N)[6] in the case of randomly distributed points, worst case complexity is O(kN^(1-1/k))[7] Alternatively the R-tree data structure was designed to support nearest neighbor search in dynamic context, as it has efficient algorithms for insertions and deletions such as the R* tree.

An approximate nearest neighbor search algorithm is allowed to return points whose distance from the query is at most

The appeal of this approach is that, in many cases, an approximate nearest neighbor is almost as good as the exact one.

The methods are based on greedy traversing in proximity neighborhood graphs

These works were preceded by a pioneering paper by Toussaint, in which he introduced the concept of a relative neighborhood graph.

Points that are close to each other under the chosen metric are mapped to the same bucket with high probability.

[18] The cover tree has a theoretical bound that is based on the dataset's doubling constant.

The bound on search time is O(c12 log n) where c is the expansion constant of the dataset.

In the special case where the data is a dense 3D map of geometric points, the projection geometry of the sensing technique can be used to dramatically simplify the search problem.

This approach requires that the 3D data is organized by a projection to a two-dimensional grid and assumes that the data is spatially smooth across neighboring grid cells with the exception of object boundaries.

In practice this technique has an average search time of O(1) or O(K) for the k-nearest neighbor problem when applied to real world stereo vision data.

[4] In high-dimensional spaces, tree indexing structures become useless because an increasing percentage of the nodes need to be examined anyway.

To speed up linear search, a compressed version of the feature vectors stored in RAM is used to prefilter the datasets in a first run.

The final candidates are determined in a second stage using the uncompressed data from the disk for distance calculation.

The optimal compression technique in multidimensional spaces is Vector Quantization (VQ), implemented through clustering.

Huge gains over VA-File, tree-based indexes and sequential scan have been observed.

This technique is commonly used in predictive analytics to estimate or classify a point based on the consensus of its neighbors.

In some applications it may be acceptable to retrieve a "good guess" of the nearest neighbor.

Often such an algorithm will find the nearest neighbor in a majority of cases, but this depends strongly on the dataset being queried.

For some applications (e.g. entropy estimation), we may have N data-points and wish to know which is the nearest neighbor for every one of those N points.