The Latent Diffusion Model (LDM)[1] is a diffusion model architecture developed by the CompVis (Computer Vision & Learning)[2] group at LMU Munich.
The LDM is an improvement on standard DM by performing diffusion modeling in a latent space, and by allowing self-attention and cross-attention conditioning.
For instance, Stable Diffusion versions 1.1 to 2.1 were based on the LDM architecture.
[6] A 2019 paper proposed the noise conditional score network (NCSN) or score-matching with Langevin dynamics (SMLD).
[7] The paper was accompanied by a software package written in PyTorch release on GitHub.
[8] A 2020 paper[9] proposed the Denoising Diffusion Probabilistic Model (DDPM), which improves upon the previous method by variational inference.
The paper was accompanied by a software package written in TensorFlow release on GitHub.
Substantial information concerning Stable Diffusion v1 was only added to GitHub on August 10, 2022.
[16] All of Stable Diffusion (SD) versions 1.1 to XL were particular instantiations of the LDM architecture.
SD 1.2 was finetuned to 1.3, 1.4 and 1.5, with 10% of text-conditioning dropped, to improve classifier-free guidance.
[18] While the LDM can work for generating arbitrary data conditional on arbitrary data, for concreteness, we describe its operation in conditional text-to-image generation.
LDM consists of a variational autoencoder (VAE), a modified U-Net, and a text encoder.
The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, capturing a more fundamental semantic meaning of the image.
Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion.
The U-Net block, composed of a ResNet backbone, denoises the output from forward diffusion backwards to obtain a latent representation.
[4] The denoising step can be conditioned on a string of text, an image, or another modality.
The encoded conditioning data is exposed to denoising U-Nets via a cross-attention mechanism.
To encode an RGB image, its three channels are divided by the maximum value, resulting in a tensor
[19][20] In the implemented version,[3]: ldm/models/autoencoder.py the encoder is a convolutional neural network (CNN) with a single self-attention mechanism near the end.
, being the concatenation of the predicted mean and variance of the latent vector, each of shape
However, the U-Net backbone has additional modules to allow for it to handle the embedding.
As an illustration, we describe a single down-scaling layer in the backbone: In pseudocode, The detailed architecture may be found in.
This is typically done using a mean squared error (MSE) loss function.
Once the model is trained, it can be used to generate new images by simply running the reverse diffusion process starting from a random noise sample.
SpatialTransformer
.