Contrastive Language-Image Pre-training

Contrastive Language-Image Pre-training (CLIP) is a technique for training a pair of neural network models, one for image understanding and one for text understanding, using a contrastive objective.

[1] This method has enabled broad applications across multiple domains, including cross-modal retrieval,[2] text-to-image generation,[3] aesthetic ranking,[4] and image captioning.

[9] The report (with some details removed, and its appendix cut out to a "Supplementary PDF") was published in Proceedings of the 38th International Conference on Machine Learning, PMLR,[1] which had a submission deadline of February 2021.

[11] The CLIP method trains a pair of models contrastively.

[1] One model takes in a piece of text as input and outputs a single vector representing its semantic content.

The other model takes in an image and similarly outputs a single vector representing its visual content.

Two vectors are considered "similar" if their dot product is large.

In essence, this loss function encourages the dot product between matching image and text vectors (

For example, Sigmoid CLIP (SigLIP)[13] proposes the following loss function:

is the negative log sigmoid loss, and the Dirac delta symbol

The image encoding models used in CLIP are typically vision transformers (ViT).

The naming convention for these models often reflects the specific ViT architecture used.

For instance, "ViT-L/14" means a "vision transformer large" (compared to other models in the same series) with a patch size of 14, meaning that the image is divided into 14-by-14 pixel patches before being processed by the transformer.

The size indicator ranges from B, L, H, G (base, large, huge, giant), in that order.

Other than ViT, the image model is typically a convolutional neural network, such as ResNet (in the original series by OpenAI), or ConvNeXt[14] (in the OpenCLIP model series by LAION[15]).

Its implementation of ResNet was the same as the original one,[18] with 3 modifications: ALIGN[11] used EfficientNet[22] of various sizes, a kind of convolutional neural network.

The text encoding models used in CLIP are typically Transformers.

Take the activations of the highest layer of the transformer on the [EOS], apply LayerNorm, then a final linear map.

The CLIP models released by OpenAI were trained on a dataset called "WebImageText" (WIT) containing 400 million pairs of images and their corresponding captions scraped from the internet.

[1] The dataset contains 500,000 text-queries, with up to 20,000 (image, text) pairs per query.

The text-queries were generated by starting with all words occurring at least 100 times in English Wikipedia, then extended by bigrams with high mutual information, names of all Wikipedia articles above a certain search volume, and WordNet synsets.

These numbers slightly differ from the standard preprocessing for ImageNet, which uses [0.485, 0.456, 0.406] and [0.229, 0.224, 0.225].

ALIGN[11] used over one billion image-text pairs, obtained by extracting images and their alt-tags from online crawling.

The largest ResNet model took 18 days to train on 592 V100 GPUs.

[29] CLIP's cross-modal retrieval enables the alignment of visual and textual data in a shared latent space, allowing users to retrieve images based on text descriptions and vice versa, without the need for explicit image annotations.

In image-to-text retrieval, images are used to find related text content.

CLIP’s ability to connect visual and textual data has found applications in multimedia search, content discovery, and recommendation systems.

[31][32] CLIP can perform zero-shot image classification tasks.

", and the {class} that results in the highest dot product is outputted.

For example, during the training of Google DeepMind's Flamingo (2022),[33] the authors trained a CLIP pair, with BERT as the text encoder and NormalizerFree ResNet F6[34] as the image encoder.

Architecture overview of CLIP.
Vision Transformer architecture. The Rep <CLS> output vector is used as the image encoding for CLIP.
One decoder layer. The Transformer used in the CLIP text encoder was made by removing the cross-attention module, then stacking the resulting module 12 times.