Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.
[1]: 293 NMT systems also tend to produce fairly literal translations.
The source and target tokens (which in the simple event are used for each other in order for a particular game ] vectors, so they can be processed mathematically.
In 1987, Robert B. Allen demonstrated the use of feed-forward neural networks for translating auto-generated English sentences with a limited vocabulary of 31 words into Spanish.
In this experiment, the size of the network's input and output layers was chosen to be just large enough for the longest sentences in the source and target language, respectively, because the network did not have any mechanism to encode sequences of arbitrary length into a fixed-size representation.
In his summary, Allen also already hinted at the possibility of using auto-associative models, one for encoding the source and one for decoding the target.
[8] Lonnie Chrisman built upon Allen's work in 1991 by training separate recursive auto-associative memory (RAAM) networks (developed by Jordan B. Pollack[9]) for the source and the target language.
[10] Forcada and Ñeco simplified this procedure in 1997 to directly train a source encoder and a target decoder in what they called a recursive hetero-associative memory.
[11] Also in 1997, Castaño and Casacuberta employed an Elman's recurrent neural network in another machine translation task with very limited vocabulary and complexity.
[12][13] Even though these early approaches were already similar to modern NMT, the computing resources of the time were not sufficient to process datasets large enough for the computational complexity of the machine translation problem on real-world texts.
[1]: 39 [14]: 2 Instead, other methods like statistical machine translation rose to become the state of the art of the 1990s and 2000s.
[1]: 39 [2]: 1 For example, in various works together with other researchers, Holger Schwenk replaced the usual n-gram language model with a neural one[15][16] and estimated phrase translation probabilities using a feed-forward network.
[19][20] All three used an RNN conditioned on a fixed encoding of the source as their decoder to produce the translation.
[21]: 107 [1]: 39 [2]: 7 This problem was addressed when Bahdanau et al. introduced attention to their encoder-decoder architecture: At each decoding step, the state of the decoder is used to calculate a source representation that focuses on different parts of the source and uses that representation in the calculation of the probabilities for the next token.
[22] Based on these RNN-based architectures, Baidu launched the "first large-scale NMT system"[23]: 144 in 2015, followed by Google Neural Machine Translation in 2016.
[27] DeepL Translator, which was at the time based on a CNN encoder, was also released in the same year and was judged by several news outlets to outperform its competitors.
[28][29][30] It has also been seen that OpenAI's GPT-3 released in 2020 can function as a neural machine translation system.
Another network architecture that lends itself to parallelization is the transformer, which was introduced by Vaswani et al. also in 2017.
[31] Like previous models, the transformer still uses the attention mechanism for weighting encoder output for the decoding steps.
However, the transformer's encoder and decoder networks themselves are also based on attention instead of recurrence or convolution: Each layer weighs and transforms the previous layer's output in a process called self-attention.
[2]: 15 [6]: 7 Since both the transformer's encoder and decoder are free from recurrent elements, they can both be parallelized during training.
[32]: 35–40 [33]: 28–31 Usually, NMT models’ weights are initialized randomly and then learned by training on parallel datasets.
However, since using large language models (LLMs) such as BERT pre-trained on large amounts of monolingual data as a starting point for learning other tasks has proven very successful in wider NLP, this paradigm is also becoming more prevalent in NMT.
[4]: 689–690 An example of this is the mBART model, which first trains one transformer on a multilingual dataset to recover masked tokens in sentences, and then fine-tunes the resulting autoencoder on the translation task.
[33]: 16–17 This is plausible considering that GPT models are trained mainly on English text.
[36] NMT has overcome several challenges that were present in statistical machine translation (SMT): NMT models are usually trained to maximize the likelihood of observing the training data.
In practice, this minimization is done iteratively on small subsets (mini-batches) of the training set using stochastic gradient descent.
So, at the beginning of the training phase, untrained models will pick the wrong token almost always; and subsequent steps would then have to work with wrong input tokens, which would slow down training considerably.
Instead, teacher forcing is used during the training phase: The model (the “student” in the teacher forcing metaphor) is always fed the previous ground-truth tokens as input for the next token, regardless of what it predicted in the previous step.
These models differ from an encoder-decoder NMT system in a number of ways:[35]: 1 A generative LLM can be prompted in a zero-shot fashion by just asking it to translate a text into another language without giving any further examples in the prompt.