GloVe

GloVe, coined from Global Vectors, is a model for distributed word representation.

The model is an unsupervised learning algorithm for obtaining vector representations for words.

This is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity.

[1] Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

As log-bilinear regression model for unsupervised learning of word representations, it combines the features of two model families, namely the global matrix factorization and local context window methods.

It is developed as an open-source project at Stanford[2] and was launched in 2014.

It was designed as a competitor to word2vec, and the original paper noted multiple improvements of GloVe over word2vec.

As of 2022[update], both approaches are outdated, and Transformer-based models, such as BERT, which add multiple neural-network attention layers on top of a word embedding model similar to Word2vec, have come to be regarded as the state of the art in NLP.

, such that the relative positions of the vectors capture part of the statistical regularities of the word

The statistical regularity is defined as the co-occurrence probabilities.

, the set of all possible words (aka "tokens").

Punctuation is either ignored, or treated as vocabulary, and similarly for capitalization and other typographical details.

since the first "that" appears in the second one's context, and vice versa.

{\displaystyle X_{i}=2\times ({\text{context size}})\times \#({\text{occurrences of word }}i)}

(except for words occurring right at the start and end of the corpus) Let

That is, if one samples a random occurrence of the word

For example, in a typical modern English corpus,

This is because the word "ado" is almost only used in the context of the archaic phrase "much ado about", but the word "much" occurs in all kinds of contexts.

For example, in a 6 billion token corpus, we have Inspecting the table, we see that the words "ice" and "steam" are indistinguishable along the "water" (often co-occurring with both) and "fashion" (rarely co-occurring with either), but distinguishable along the "solid" (co-occurring more with ice) and "gas" (co-occurring more with "steam").

, such that we have a multinomial logistic regression:

Naively, logistic regression can be run by minimizing the squared loss:

However, this would be noisy for rare co-occurrences.

To fix the issue, the squared loss is weighted so that the loss is slowly ramped-up as the absolute number of co-occurrences

In the original paper, the authors found that

as the final representation vector for word

GloVe can be used to find relations between words like synonyms, company-product relations, zip codes and cities, etc.

However, the unsupervised learning algorithm is not effective in identifying homographs, i.e., words with the same spelling and different meanings.

This is as the unsupervised learning algorithm calculates a single set of vectors for words with the same morphological structure.

[5] The algorithm is also used by the SpaCy library to build semantic word embedding features, while computing the top list words that match with distance measures such as cosine similarity and Euclidean distance approach.

[6] GloVe was also used as the word representation framework for the online and offline systems designed to detect psychological distress in patient interviews.