MinHash

In computer science and data mining, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are.

Let U be a set and A and B be subsets of U, then the Jaccard index is defined to be the ratio of the number of elements of their intersection and the number of elements of their union: This value is 0 when the two sets are disjoint, 1 when they are equal, and strictly between 0 and 1 otherwise.

Two sets are more similar (i.e. have relatively more members in common) when their Jaccard index is closer to 1.

The goal of MinHash is to estimate J(A,B) quickly, without explicitly computing the intersection and union.

(In cases where the hash function used is assumed to have pseudo-random properties, the random permutation would not be used.)

The idea of the MinHash scheme is to reduce this variance by averaging together several variables constructed in the same way.

Therefore, their average is also an unbiased estimator, and by standard deviation for sums of 0-1 random variables, its expected error is O(1/√k).

It may be computationally expensive to compute multiple hash functions, but a related version of MinHash scheme avoids this penalty by using only a single hash function and uses it to select multiple values from each set rather than selecting only a single minimum value per hash function.

By standard Chernoff bounds for sampling without replacement, this estimator has expected error O(1/√k), matching the performance of the multiple-hash-function scheme.

The estimator |Y|/k can be computed in time O(k) from the two signatures of the given sets, in either variant of the scheme.

Specifically, for set size n the many hash variant takes O(n k) time.

The single hash variant is generally faster, requiring O(n) time to maintain the queue of minimum hash values assuming n >> k.[1] A variety of techniques to introduce weights into the computation of MinHashes have been developed.

A uniformly random hash between 0 and 1 can be converted to follow an exponential distribution by CDF inversion.

This method exploits the many beautiful properties of the minimum of a set of exponential variables.

This yields as its collision probability the probability Jaccard index[7] In order to implement the MinHash scheme as described above, one needs the hash function h to define a random permutation on n elements, where n is the total number of distinct elements in the union of all of the sets to be compared.

different permutations, it would require Ω(n log n) bits just to specify a truly random permutation, an infeasibly large number for even moderate values of n. Because of this fact, by analogy to the theory of universal hashing, there has been significant work on finding a family of permutations that is "min-wise independent", meaning that for any subset of the domain, any element is equally likely to be the minimum.

This guarantee is, among other things, sufficient to give the Jaccard bound required by the MinHash algorithm.

bits, this approach is much more practical than using completely min-wise independent permutations.

[12] In data mining, Cohen et al. (2001) use MinHash as a tool for association rule learning.

Given a database in which each entry has multiple attributes (viewed as a 0–1 matrix with a row per database entry and a column per attribute) they use MinHash-based approximations to the Jaccard index to identify candidate pairs of attributes that frequently co-occur, and then compute the exact value of the index for only those pairs to determine the ones whose frequencies of co-occurrence are below a given strict threshold.

[16] Accurate average nucleotide identity (ANI) values can be generated very efficiently with MinHash-based algorithms.

Other locality sensitive hashing techniques exist for Hamming distance between sets and cosine distance between vectors; locality sensitive hashing has important applications in nearest neighbor search algorithms.

[18] For large distributed systems, and in particular MapReduce, there exist modified versions of MinHash to help compute similarities with no dependence on the point dimension.

[19] A large scale evaluation was conducted by Google in 2006 [20] to compare the performance of Minhash and SimHash[21] algorithms.