Latent diffusion model

The Latent Diffusion Model (LDM)[1] is a diffusion model architecture developed by the CompVis (Computer Vision & Learning)[2] group at LMU Munich.

The LDM is an improvement on standard DM by performing diffusion modeling in a latent space, and by allowing self-attention and cross-attention conditioning.

For instance, Stable Diffusion versions 1.1 to 2.1 were based on the LDM architecture.

[6] A 2019 paper proposed the noise conditional score network (NCSN) or score-matching with Langevin dynamics (SMLD).

[7] The paper was accompanied by a software package written in PyTorch release on GitHub.

[8] A 2020 paper[9] proposed the Denoising Diffusion Probabilistic Model (DDPM), which improves upon the previous method by variational inference.

The paper was accompanied by a software package written in TensorFlow release on GitHub.

Substantial information concerning Stable Diffusion v1 was only added to GitHub on August 10, 2022.

[16] All of Stable Diffusion (SD) versions 1.1 to XL were particular instantiations of the LDM architecture.

SD 1.2 was finetuned to 1.3, 1.4 and 1.5, with 10% of text-conditioning dropped, to improve classifier-free guidance.

[18] While the LDM can work for generating arbitrary data conditional on arbitrary data, for concreteness, we describe its operation in conditional text-to-image generation.

LDM consists of a variational autoencoder (VAE), a modified U-Net, and a text encoder.

The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, capturing a more fundamental semantic meaning of the image.

Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion.

The U-Net block, composed of a ResNet backbone, denoises the output from forward diffusion backwards to obtain a latent representation.

[4] The denoising step can be conditioned on a string of text, an image, or another modality.

The encoded conditioning data is exposed to denoising U-Nets via a cross-attention mechanism.

To encode an RGB image, its three channels are divided by the maximum value, resulting in a tensor

[19][20] In the implemented version,[3]: ldm/models/autoencoder.py  the encoder is a convolutional neural network (CNN) with a single self-attention mechanism near the end.

, being the concatenation of the predicted mean and variance of the latent vector, each of shape

However, the U-Net backbone has additional modules to allow for it to handle the embedding.

As an illustration, we describe a single down-scaling layer in the backbone: In pseudocode, The detailed architecture may be found in.

This is typically done using a mean squared error (MSE) loss function.

Once the model is trained, it can be used to generate new images by simply running the reverse diffusion process starting from a random noise sample.

A single cross-attention mechanism as it appears in a standard Transformer language model.
Block diagram for the full Transformer architecture. The stack on the right is a standard pre-LN Transformer decoder, which is essentially the same as the SpatialTransformer .