Diffusion model

There are various equivalent formalisms, including Markov chains, denoising diffusion probabilistic models, noise conditioned score networks, and stochastic differential equations.

These typically involve training a neural network to sequentially denoise images blurred with Gaussian noise.

Diffusion-based image generators have seen widespread commercial interest, such as Stable Diffusion and DALL-E.

The randomness is necessary: if the particles were to undergo only gradient descent, then they will all fall to the origin, collapsing the distribution.

The 2020 paper proposed the Denoising Diffusion Probabilistic Model (DDPM), which improves upon the previous method by variational inference.

The DDPM paper suggested not learning it (since it resulted in "unstable training and poorer sample quality"), but fixing it at some value

They are also called noise conditional score network (NCSN) or score-matching with Langevin dynamics (SMLD).

giving us a loss function, also known as the Hyvärinen scoring rule, that can be minimized by stochastic gradient descent.

limit, we obtain a continuous diffusion process, in the form of a stochastic differential equation:

The name "noise conditional score network" is explained thus: DDPM and score-based generative models are equivalent.

and the term inside becomes a least squares regression, so if the network actually reaches the global minimum of loss, then we have

The original DDPM method for generating images is slow, since the forward diffusion process usually takes

DDIM[25] is a method to take any model trained on DDPM loss, and use it to sample with some steps skipped, sacrificing an adjustable amount of quality.

[28] This parameterization was found to improve performance, as the model can be trained to reach total noise (i.e.

The original publication used CLIP text encoders to improve text-conditional image generation.

Taking the perspective of the noisy channel model, we can understand the process as follows: To generate an image

Therefore, classifier guidance works for denoising diffusion as well, using the modified noise prediction:[30]

The idea of optimal transport flow [46] is to construct a probability path minimizing the Wasserstein metric.

In rectified flow, by injecting strong priors that intermediate trajectories are straight, it can achieve both theoretical relevance for optimal transport and computational efficiency, as ODEs with straight paths can be simulated precisely without time discretization.

is "projected" into a space of causally simulatable ODEs, by minimizing the least squares loss with respect to the direction

This rectifying process is also known as Flow Matching,[49] Stochastic Interpolation,[50] and alpha-(de)blending.

[51] A distinctive aspect of rectified flow is its capability for "reflow", which straightens the trajectory of ODE paths.

This "reflow" process not only reduces transport costs but also straightens the paths of rectified flows, making

This framework encompasses DDIM and probability flow ODEs as special cases, with particular choices of

Each human motion trajectory is a sequence of poses, represented by either joint rotations or positions.

Instead, it uses a Transformer architecture that autoregressively generates a sequence of tokens, which is then converted to an image by the decoder of a discrete VAE.

The denoising network is a U-Net, with cross-attention blocks to allow for conditional image generation.

Imagen (2022)[68][69] uses a T5-XXL language model to encode the input text into an embedding vector.

The first step denoises a white noise to a 64×64 image, conditional on the embedding vector of the text.

[78] Movie Gen (2024) is a series of Diffusion Transformers operating on latent space and by flow matching.

Transfusion architectural diagram