Feature hashing

[1][2] It works by applying a hash function to the features and using their hash values as indices directly (after a modulo operation), rather than looking the indices up in an associative array.

In addition to its use for encoding non-numeric values, feature hashing can also be used for dimensionality reduction.

[2] This trick is often attributed to Weinberger et al. (2009),[2] but there exists a much earlier description of this method published by John Moody in 1989.

From this, a bag of words (BOW) representation is constructed: the individual tokens are extracted and counted, and each distinct token in the training set defines a feature (independent variable) of each of the documents in both the training and test sets.

Machine learning algorithms, however, are typically defined in terms of numerical vectors.

(An alternative convention swaps the rows and columns of the matrix, but this difference is immaterial.)

The common approach is to construct, at learning time or prior to that, a dictionary representation of the vocabulary of the training set, and use that to map words to indices.

Hash tables and tries are common candidates for dictionary implementation.

[3] On the contrary, if the vocabulary is kept fixed and not increased with a growing training set, an adversary may try to invent new words or misspellings that are not in the stored vocabulary so as to circumvent a machine learned filter.

Research attempted to use feature hashing for their spam filters.

[4] Note that the hashing trick isn't limited to text classification and similar tasks at the document level, but can be applied to any problem that involves large (perhaps unbounded) numbers of features.

However, suppose we want to process all possible words made of the English letters, then

Most neural networks can only operate on real vector inputs, so we must construct a "dictionary" function

One-hot encoding is easy to interpret, but it requires one to maintain the arbitrary enumeration of

In fact, we can relax the requirement slightly: It suffices to have a fast-to-compute injection

In practice, there is no simple way to construct an efficient injection

The basic feature hashing algorithm presented in (Weinberger et al. 2009)[2] is defined as follows.

Finally, extend this feature hashing function to strings of tokens by

Taking its completion would get us to a Hilbert space, which allows well-behaved infinite sums.

Now we have an inner product space, with enough structure to describe the geometry of the feature hashing function

The above statement and proof interprets the binary hash function

For a rigorous statement and proof, see [2] Instead of maintaining a dictionary, a feature vectorizer that uses the hashing trick can build a vector of a pre-defined length by applying a hash function h to the features (e.g., words), then using the hash values directly as feature indices and updating the resulting vector at those indices.

It has been suggested that a second, single-bit output hash function ξ be used to determine the sign of the update value, to counter the effect of hash collisions.

[2] If such a hash function is used, the algorithm becomes The above pseudocode actually converts each sample into a vector.

pairs and let the learning and prediction algorithms consume such streams; a linear model can then be implemented as a single hash table representing the coefficient vector.

A machine learning model trained on feature-hashed words would then have difficulty distinguishing

To handle this, one can train supervised hashing functions that avoids mapping common tokens to the same feature vectors.

[5] Ganchev and Dredze showed that in text classification applications with random hash functions and several tens of thousands of columns in the output vectors, feature hashing need not have an adverse effect on classification performance, even without the signed hash function.

[2] Chen et al. (2015) combined the idea of feature hashing and sparse matrix to construct "virtual matrices": large matrices with small storage requirements.

With virtual matrices, they constructed HashedNets, which are large neural networks taking only small amounts of storage.