Attention Is All You Need

"Attention Is All You Need"[1] is a 2017 landmark[2][3] research paper in machine learning authored by eight scientists working at Google.

The paper introduced a new deep learning architecture known as the transformer, based on the attention mechanism proposed in 2014 by Bahdanau et al.[4] It is considered a foundational[5] paper in modern artificial intelligence, as the transformer approach has become the main architecture of large language models like those based on GPT.

[8] The name "Transformer" was picked because Jakob Uszkoreit, one of the paper's authors, liked the sound of that word.

These convinced the team that the Transformer is a general purpose language model, and not just good for translation.

The authors of the paper are: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin.

The Wired article highlights the group's diversity:[8]Six of the eight authors were born outside the United States; the other two are children of two green-card-carrying Germans who were temporarily in California and a first-generation American whose family had fled persecution, respectively.

Scaled dot-product attention & self-attention The use of the scaled dot-product attention and self-attention mechanism instead of a Recurrent neural network or Long short-term memory (which rely on recurrence instead) allow for better performance as described in the following paragraph.

Since the model relies on Query (Q), Key (K) and Value (V) matrices that come from the same source itself (i.e. the input sequence / context window), this eliminates the need for RNNs completely ensuring parallelizability for the architecture.

Multi-head attention In the self-attention mechanism, queries (Q), keys (K), and values (V) are dynamically generated for each input sequence (limited typically by the size of the context window), allowing the model to focus on different parts of the input sequence at different steps.

By doing this, multi-head attention ensures that the input embeddings are updated from a more varied and diverse set of perspectives.

embedding is then added to the word at that corresponding position with respect to the current context window.

The paper specifically comments on why this method was chosen describing: "We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.

"[1] For many years, sequence modelling and generation was done by using plain recurrent neural networks (RNNs).

In theory, the information from one token can propagate arbitrarily far down the sequence, but in practice the vanishing-gradient problem leaves the model's state at the end of a long sentence without precise, extractable information about preceding tokens.

A key breakthrough was LSTM (1995),[note 1] a RNN which used various innovations to overcome the vanishing gradient problem, allowing efficient learning of long-sequence modelling.

Modern Transformers overcome this problem, but unlike RNNs, they require computation time that is quadratic in the size of the context window.

[22][23] These early seq2seq models had no attention mechanism, and the state vector is accessible only after the last word of the source text was processed.

[27] Seq2seq models with attention (including self-attention) still suffered from the same issue with recurrent networks, which is that they are hard to parallelize, which prevented them from being accelerated on GPUs.

In 2016, decomposable attention applied a self-attention mechanism to feedforward networks, which are easy to parallelize, and achieved SOTA result in textual entailment with an order of magnitude fewer parameters than LSTMs.

[29] That hypothesis was against conventional wisdom at the time, and even his father Hans Uszkoreit, a well-known computational linguist, was skeptical.

[29] In the same year, self-attention (called intra-attention or intra-sentence attention) was proposed for LSTMs.

[1] This led to the introduction of a multi-head attention model that was easier to parallelize due to the use of independent heads and the lack of recurrence.

Already in spring 2017, even before the "Attention is all you need" preprint was published, one of the co-authors applied the "decoder-only" variation of the architecture to generate fictitious Wikipedia articles.

[35] Starting in 2018, the OpenAI GPT series of decoder-only Transformers became state of the art in natural language generation.

In 2022, a chatbot based on GPT-3, ChatGPT, became unexpectedly popular,[36] triggering a boom around large language models.

[43] Image and video generators like DALL-E (2021), Stable Diffusion 3 (2024),[44] and Sora (2024), are based on the Transformer architecture.

While the primary focus of the paper at the time was to improve machine translation, the paper also discussed the use of the architecture on English Constituency Parsing, both with limited and large-sized datasets, achieving a high-score without specific tuning for the task indicating the promising nature of the model for use in a wide-variety of general purpose of seq2seq tasks.

A separate translation model was trained on the much larger 2014 WMT English-French dataset, consisting of 36 million sentences.

Both the base and big models outperforms the 2017 state-of-the-art in both English-German and English-French while achieving the comparatively lowest training cost.

Dropout layers were applied to the output of each sub-layer before normalization, the sums of the embeddings, and the positional encodings.