Transformer (deep learning architecture)

A key breakthrough was LSTM (1995),[note 1] a RNN which used various innovations to overcome the vanishing gradient problem, allowing efficient learning of long-sequence modelling.

Modern Transformers overcome this problem, but unlike RNNs, they require computation time that is quadratic in the size of the context window.

[24][25] These early seq2seq models had no attention mechanism, and the state vector is accessible only after the last word of the source text was processed.

[29] Seq2seq models with attention (including self-attention) still suffered from the same issue with recurrent networks, which is that they are hard to parallelize, which prevented them from being accelerated on GPUs.

In 2016, decomposable attention applied a self-attention mechanism to feedforward networks, which are easy to parallelize, and achieved SOTA result in textual entailment with an order of magnitude fewer parameters than LSTMs.

[31] That hypothesis was against conventional wisdom at the time, and even his father Hans Uszkoreit, a well-known computational linguist, was skeptical.

[1] This led to the introduction of a multi-head attention model that was easier to parallelize due to the use of independent heads and the lack of recurrence.

Already in spring 2017, even before the "Attention is all you need" preprint was published, one of the co-authors applied the "decoder-only" variation of the architecture to generate fictitious Wikipedia articles.

[37] Starting in 2018, the OpenAI GPT series of decoder-only Transformers became state of the art in natural language generation.

In 2022, a chatbot based on GPT-3, ChatGPT, became unexpectedly popular,[38] triggering a boom around large language models.

[44] Image and video generators like DALL-E (2021), Stable Diffusion 3 (2024),[45] and Sora (2024), are based on the Transformer architecture.

In an autoregressive task,[50] the entire sequence is masked at first, and the model produces a probability distribution for the first token.

In the author's words, "we hypothesized it would allow the model to easily learn to attend by relative position."

, determines how the attended tokens influence what information is passed to subsequent layers and ultimately the output logits.

[58] BERT, another language model, only makes use of an encoder, and is trained to predict a randomly masked token in a sequence.

The final points of detail are the residual connections and layer normalization (LayerNorm, or LN), which while conceptually unnecessary, are necessary for numerical stability and convergence.

It was difficult to train and required careful hyperparameter tuning and a "warm-up" in learning rate, where it starts small and gradually increases.

The pre-LN convention, proposed several times in 2018,[59] was found to be easier to train, requiring no warm-up, leading to faster convergence.

Thus, the decoder layers in a decoder-only Transformer is composed of just two sublayers: the causally masked self-attention, and the feedforward network.

[69] The original Transformer paper reported using a learned positional encoding,[70] but finding it not superior to the sinusoidal one.

Transformers is a library produced by Hugging Face that supplies transformer-based architectures and pretrained models.

The KV caching method saves the computed key and value vectors at each attention block, so that they are not recomputed at each new token.

An improved version, FlashAttention-2,[80][81][82] was developed to cater to the rising demand for language models capable of handling longer context lengths.

It offers enhancements in work partitioning and parallelism, enabling it to achieve up to 230 TFLOPs/s on A100 GPUs (FP16/BF16), a 2x speed increase over the original FlashAttention.

Key advancements in FlashAttention-2 include the reduction of non-matmul FLOPs, improved parallelism over the sequence length dimension, better work partitioning between GPU warps, and added support for head dimensions up to 256 and multi-query attention (MQA) and grouped-query attention (GQA).

[87][89] In Multi-Token Prediction, a single forward pass creates a final embedding vector, which then is un-embedded into a token probability.

In the image domain, Swin Transformer is an efficient architecture that performs attention inside shifting windows.

A 2022 study found that Transformers pretrained only on natural language can be finetuned on only 0.03% of parameters and become competitive with LSTMs on a variety of logical and visual tasks, demonstrating transfer learning.

[112] Parti is an encoder-decoder Transformer, where the encoder processes a text prompt, and the decoder generates a token representation of an image.

Many large language models such as GPT-2, GPT-3, GPT-4, AlbertAGPT, Claude, BERT, XLNet, RoBERTa and ChatGPT demonstrate the ability of transformers to perform a wide variety of NLP-related subtasks and their related real-world applications, including: Beyond traditional NLP, the transformer architecture has had success in other applications, such as:

A standard Transformer architecture, showing on the left an encoder, and on the right a decoder. Note: it uses the pre-LN convention, which is different from the post-LN convention used in the original 2017 Transformer.
A diagram of a sinusoidal positional encoding with parameters
One encoder-decoder block
A Transformer is composed of stacked encoder layers and decoder layers.
The feedforward network module. It is a two-layered network that maps -dimensional vectors into -dimensional vectors.
Scaled dot-product attention, block diagram
Exact dimension counts within an attention head module
Multiheaded attention, block diagram
Exact dimension counts within a multiheaded attention module
One encoder layer
One decoder layer
(a) One encoder layer and one decoder layer. (b) Two encoder layers and two decoder layers. The sublayers are labelled as well.
Transformer encoder with norm-first and norm-last
Transformer decoder with norm-first and norm-last
Block diagram for the full Transformer architecture
Schematic object hierarchy for the full Transformer architecture, in object-oriented programming style
Comparison between several different forms of attention mechanism and the amount of KV caching necessary for each.
The architecture of V2, showing both MLA and a variant of mixture of experts . [ 86 ] : Figure 2
Multi-Token Prediction.