A key breakthrough was LSTM (1995),[note 1] a RNN which used various innovations to overcome the vanishing gradient problem, allowing efficient learning of long-sequence modelling.
Modern Transformers overcome this problem, but unlike RNNs, they require computation time that is quadratic in the size of the context window.
[24][25] These early seq2seq models had no attention mechanism, and the state vector is accessible only after the last word of the source text was processed.
[29] Seq2seq models with attention (including self-attention) still suffered from the same issue with recurrent networks, which is that they are hard to parallelize, which prevented them from being accelerated on GPUs.
In 2016, decomposable attention applied a self-attention mechanism to feedforward networks, which are easy to parallelize, and achieved SOTA result in textual entailment with an order of magnitude fewer parameters than LSTMs.
[31] That hypothesis was against conventional wisdom at the time, and even his father Hans Uszkoreit, a well-known computational linguist, was skeptical.
[1] This led to the introduction of a multi-head attention model that was easier to parallelize due to the use of independent heads and the lack of recurrence.
Already in spring 2017, even before the "Attention is all you need" preprint was published, one of the co-authors applied the "decoder-only" variation of the architecture to generate fictitious Wikipedia articles.
[37] Starting in 2018, the OpenAI GPT series of decoder-only Transformers became state of the art in natural language generation.
In 2022, a chatbot based on GPT-3, ChatGPT, became unexpectedly popular,[38] triggering a boom around large language models.
[44] Image and video generators like DALL-E (2021), Stable Diffusion 3 (2024),[45] and Sora (2024), are based on the Transformer architecture.
In an autoregressive task,[50] the entire sequence is masked at first, and the model produces a probability distribution for the first token.
In the author's words, "we hypothesized it would allow the model to easily learn to attend by relative position."
, determines how the attended tokens influence what information is passed to subsequent layers and ultimately the output logits.
[58] BERT, another language model, only makes use of an encoder, and is trained to predict a randomly masked token in a sequence.
The final points of detail are the residual connections and layer normalization (LayerNorm, or LN), which while conceptually unnecessary, are necessary for numerical stability and convergence.
It was difficult to train and required careful hyperparameter tuning and a "warm-up" in learning rate, where it starts small and gradually increases.
The pre-LN convention, proposed several times in 2018,[59] was found to be easier to train, requiring no warm-up, leading to faster convergence.
Thus, the decoder layers in a decoder-only Transformer is composed of just two sublayers: the causally masked self-attention, and the feedforward network.
[69] The original Transformer paper reported using a learned positional encoding,[70] but finding it not superior to the sinusoidal one.
Transformers is a library produced by Hugging Face that supplies transformer-based architectures and pretrained models.
The KV caching method saves the computed key and value vectors at each attention block, so that they are not recomputed at each new token.
An improved version, FlashAttention-2,[80][81][82] was developed to cater to the rising demand for language models capable of handling longer context lengths.
It offers enhancements in work partitioning and parallelism, enabling it to achieve up to 230 TFLOPs/s on A100 GPUs (FP16/BF16), a 2x speed increase over the original FlashAttention.
Key advancements in FlashAttention-2 include the reduction of non-matmul FLOPs, improved parallelism over the sequence length dimension, better work partitioning between GPU warps, and added support for head dimensions up to 256 and multi-query attention (MQA) and grouped-query attention (GQA).
[87][89] In Multi-Token Prediction, a single forward pass creates a final embedding vector, which then is un-embedded into a token probability.
In the image domain, Swin Transformer is an efficient architecture that performs attention inside shifting windows.
A 2022 study found that Transformers pretrained only on natural language can be finetuned on only 0.03% of parameters and become competitive with LSTMs on a variety of logical and visual tasks, demonstrating transfer learning.
[112] Parti is an encoder-decoder Transformer, where the encoder processes a text prompt, and the decoder generates a token representation of an image.
Many large language models such as GPT-2, GPT-3, GPT-4, AlbertAGPT, Claude, BERT, XLNet, RoBERTa and ChatGPT demonstrate the ability of transformers to perform a wide variety of NLP-related subtasks and their related real-world applications, including: Beyond traditional NLP, the transformer architecture has had success in other applications, such as: