Vision transformer

ViTs were designed as alternatives to convolutional neural networks (CNNs) in computer vision applications.

[2] Compared to CNNs, ViTs are less data efficient, but have higher capacity.

Some of the largest modern computer vision models are ViTs, such as one with 22B parameters.

[3][4] Subsequent to its publication, many variants were proposed, with hybrid architectures with both features of ViTs and CNNs.

[5][6] Transformers were introduced in Attention Is All You Need (2017),[7] and have found widespread use in natural language processing.

In 2020, an encoder-only Transformer was adapted for computer vision, yielding the ViT, which reached state of the art in image classification, overcoming the previous dominance of CNN.

[1] The masked autoencoder (2022) extended ViT to work with unsupervised training.

The vision transformer and the masked autoencoder, in turn, stimulated new developments in convolutional neural networks.

Two studies [11][12] improved efficiency and robustness of ViT by adding a CNN as a preprocessor.

The Swin Transformer[13] achieved state-of-the-art results on some object detection datasets such as COCO, by using convolution-like sliding windows of attention mechanism, and the pyramid process in classical computer vision.

For example, to use it for classification, one can add a shallow MLP on top of it that outputs a probability distribution over classes.

The special token is an architectural hack to allow the model to compress all information relevant for predicting the image label into one vector.Transformers found their initial applications in natural language processing tasks, as demonstrated by language models such as BERT and GPT-3.

By contrast the typical image processing system uses a convolutional neural network (CNN).

Well-known projects include Xception, ResNet, EfficientNet,[14] DenseNet,[15] and Inception.

[16] Transformers measure the relationships between pairs of input tokens (words in the case of text strings), termed attention.

In the original ViT and Masked Autoencoder, they used a dummy [CLS] token , in emulation of the BERT language model.

The output at [CLS] is the classification token, which is then processed by a LayerNorm-feedforward-softmax module into a probability distribution.

The second one (called "decoder", even though it is still an encoder-only Transformer) takes in vectors with positional encoding and outputs image patches again.

[24] Like the Masked Autoencoder, the DINO (self-distillation with no labels) method is a way to train a ViT by self-supervision.

The method is similar to previous works like momentum contrast[26] and bootstrap your own latent (BYOL).

To prevent this collapse, DINO employs two strategies: In January 2024, Meta AI Research released an updated version called DINOv2[28] with improvements in architecture, loss function, and optimization technique.

The Swin Transformer ("Shifted windows")[13] took inspiration from standard CNNs: It is improved by Swin Transformer V2,[29] which modifies upon the ViT by a different attention mechanism[13]: Figure 1 : The TimeSformer[30] was designed for video understanding tasks, and it applied a factorized self-attention, similar to the factorized convolution kernels found in the Inception CNN architecture.

The idea is essentially the same as vector quantized variational autoencoder (VQVAE) plus generative adversarial network (GAN).

Further, one can take a list of caption-image pairs, convert the images into strings of symbols, and train a standard GPT-style transformer.

[33] Other examples include the visual transformer,[34] CoAtNet,[35] CvT,[36] the data-efficient ViT (DeiT),[37] etc.

[38] Typically, ViT uses patch sizes larger than standard CNN kernels (3x3 to 7x7).

Preprocessing with a layer of smaller-size, overlapping (stride < size) convolutional filters helps with performance and stability.

[2] ViT applies self-attention, allowing them to easily capture long-range relationships between patches.

ViT also appears more robust to input image distortions such as adversarial patches or permutations.

[39] ViT have been used in many Computer Vision tasks with excellent results and in some cases even state-of-the-art.

The architecture of vision transformer. An input image is divided into patches, each of which is linearly mapped through a patch embedding layer, before entering a standard Transformer encoder.
Vision Transformer architecture, showing the encoder-only Transformer blocks inside.
Animation of ViT. The 0th token is the special <CLS> . The other 9 patches are projected by a linear layer before being fed into the Transformer encoder as input tokens 1 to 9.
Masked Autoencoder architecture.