[1] (The number of buckets is much smaller than the universe of possible input items.
)[1] Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search.
Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.
Hashing-based approximate nearest-neighbor search algorithms generally use one of two main categories of hashing methods: either data-independent methods, such as locality-sensitive hashing (LSH); or data-dependent methods, such as locality-preserving hashing (LPH).
[2][3] Locality-preserving hashing was initially devised as a way to facilitate data pipelining in implementations of massively parallel algorithms that use randomized routing and universal hashing to reduce memory contention and network congestion.
Alternatively[8] it is possible to define an LSH family on a universe of items U endowed with a similarity function
[7] This approach works for the Hamming distance over d-dimensional vectors
of hash functions is simply the family of all the projections of points on one of the
simply selects a random bit from the input point.
Suppose U is composed of subsets of some ground set of enumerable items S and the similarity function of interest is the Jaccard index J.
Because the symmetric group on n elements has size n!, choosing a truly random permutation from the full symmetric group is infeasible for even moderately sized n. Because of this fact, there has been significant work on finding a family of permutations that is "min-wise independent" — a permutation family for which each element of the domain has equal probability of being the minimum under a randomly chosen π.
It has been established that a min-wise independent family of permutations is at least of size
[22] Nilsimsa is a locality-sensitive hashing algorithm used in anti-spam efforts.
The paper suggests that the Nilsimsa satisfies three requirements: Testing performed in the paper on a range of file types identified the Nilsimsa hash as having a significantly higher false positive rate when compared to other similarity digest schemes such as TLSH, Ssdeep and Sdhash.
[24] TLSH is locality-sensitive hashing algorithm designed for a range of security and digital forensic applications.
[17] The goal of TLSH is to generate hash digests for messages such that low distances between digests indicate that their corresponding messages are likely to be similar.
[25] The random projection method of LSH due to Moses Charikar[8] called SimHash (also sometimes called arccos[26]) uses an approximation of the cosine distance between vectors.
[8] The basic idea of this technique is to choose a random hyperplane (defined by a normal unit vector r) at the outset and use the hyperplane to hash input vectors.
This way, each possible choice of a random hyperplane r can be interpreted as a hash function
,[8][27] the probability of two vectors being on different sides of the random hyperplane is approximately proportional to the cosine distance between them.
Each hash function in the family is indexed by a choice of random
is a d-dimensional vector with entries chosen independently from a stable distribution and
Other construction methods for hash functions have been proposed to better fit the data.
Semantic hashing is a technique that attempts to map input items to addresses such that closer inputs have higher semantic similarity.
[30] The hashcodes are found via training of an artificial neural network or graphical model.
[citation needed] One of the main applications of LSH is to provide a method for efficient approximate nearest neighbor search algorithms.
The process is stopped as soon as a point within distance cR from q is found.
Given the parameters k and L, the algorithm has the following performance guarantees: For a fixed approximation ratio
Then one obtains the following performance guarantees: When t is large, it is possible to reduce the hashing time from
(not including hashing costs) and similarly the space usage.