Word2vec

Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words.

This indicates the level of semantic similarity between the words, so for example the vectors for walk and ran are nearby, as are those for "but" and "however", and "Berlin" and "Germany".

These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words.

Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space.

In both architectures, word2vec considers both individual words and a sliding context window as it iterates over the corpus.

According to the authors' note,[3] CBOW is faster while skip-gram does a better job for infrequent words.

Both CBOW and skip-gram are methods to learn one vector per word appearing in the corpus.

, then take the dot-product-softmax with every other vector sum (this step is similar to the attention mechanism in Transformers), to obtain the probability:

In 2010, Tomáš Mikolov (then at Brno University of Technology) with co-authors applied a simple recurrent neural network with a single hidden layer to language modelling.

[6] Word2vec was created, patented,[7] and published in 2013 by a team of researchers led by Mikolov at Google over two papers.

Transformer-based models, such as ELMo and BERT, which add multiple neural-network attention layers on top of a word embedding model similar to Word2vec, have come to be regarded as the state of the art in NLP.

To approximate the conditional log-likelihood a model seeks to maximize, the hierarchical softmax method uses a Huffman tree to reduce calculation.

doc2vec, generates distributed representations of variable-length pieces of texts, such as sentences, paragraphs, or entire documents.

The first, Distributed Memory Model of Paragraph Vectors (PV-DM), is identical to CBOW other than it also provides a unique document identifier as a piece of additional context.

[14] doc2vec also has the ability to capture the semantic ‘meanings’ for additional pieces of ‘context’ around words; doc2vec can estimate the semantic embeddings for speakers or speaker attributes, groups, and periods of time.

[17] Another extension of word2vec is top2vec, which leverages both document and word embeddings to estimate distributed representations of topics.

[18][19] top2vec takes document embeddings learned from a doc2vec model and reduces them into a lower dimension (typically using UMAP).

An extension of word vectors for n-grams in biological sequences (e.g. DNA, RNA, and proteins) for bioinformatics applications has been proposed by Asgari and Mofrad.

The results suggest that BioVectors can characterize biological sequences in terms of biochemical and biophysical interpretations of the underlying patterns.

If the Word2vec model has not encountered a particular word before, it will be forced to use a random vector, which is generally far from its ideal representation.

IWE combines Word2vec with a semantic dictionary mapping technique to tackle the major challenges of information extraction from clinical texts, which include ambiguity of free text narrative style, lexical variations, use of ungrammatical and telegraphic phases, arbitrary ordering of words, and frequent appearance of abbreviations and acronyms.

The reasons for successful word embedding learning in the word2vec framework are poorly understood.

[4] Levy et al. (2015)[24] show that much of the superior performance of word2vec or similar embeddings in downstream tasks is not a result of the models per se, but of the choice of specific hyperparameters.

Transferring these hyperparameters to more 'traditional' approaches yields similar performances in downstream tasks.

Arora et al. (2016)[25] explain word2vec and related algorithms as performing inference for a simple generative model for text, which involves a random walk generation process based upon loglinear topic model.

Mikolov et al. (2013)[26] found that semantic and syntactic patterns can be reproduced using vector arithmetic.

[27] Mikolov et al. (2013)[1] developed an approach to assessing the quality of a word2vec model which draws on the semantic and syntactic patterns discussed above.

Altszyler and coauthors (2017) studied Word2vec performance in two semantic tests for different corpus size.

[29] They found that Word2vec has a steep learning curve, outperforming another word-embedding technique, latent semantic analysis (LSA), when it is trained with medium to large corpus size (more than 10 million words).

Additionally they show that the best parameter setting depends on the task and the training corpus.