Flow-based generative model

A flow-based generative model is a generative model used in machine learning that explicitly models a probability distribution by leveraging normalizing flow,[1][2][3] which is a statistical method using the change-of-variable law of probabilities to transform a simple distribution into a complex one.

For example, the negative log-likelihood can be directly computed and minimized as the loss function.

In contrast, many alternative generative modeling methods such as variational autoencoder (VAE) and generative adversarial network do not explicitly represent the likelihood function.

is (see derivation): To efficiently compute the log likelihood, the functions

should be easily invertible, and the determinants of their Jacobians should be simple to compute.

are modeled using deep neural networks, and are trained to minimize the negative log-likelihood of data samples from the target distribution.

These architectures are usually designed such that only the forward pass of the neural network is required in both the inverse and the Jacobian determinant calculations.

Examples of such architectures include NICE,[4] RealNVP,[5] and Glow.

is an invertible matrix), we have: The log likelihood is thus: In general, the above applies to any

subtracted by a non-recursive term, we can infer by induction that: As is generally done when training a deep learning model, the goal with normalizing flows is to minimize the Kullback–Leibler divergence between the model's likelihood and the target distribution to be estimated.

the target distribution to learn, the (forward) KL-divergence is: The second term on the right-hand side of the equation corresponds to the entropy of the target distribution and is independent of the parameter

we want the model to learn, which only leaves the expectation of the negative log-likelihood to minimize under the target distribution.

This intractable term can be approximated with a Monte-Carlo method by importance sampling.

, then this term can be estimated as: Therefore, the learning objective is replaced by In other words, minimizing the Kullback–Leibler divergence between the model's likelihood and the target distribution is equivalent to maximizing the model likelihood under observed samples of the target distribution.

[7] A pseudocode for training normalizing flows is as follows:[8] The earliest example.

In generative flow model,[6] each layer has 3 parts: The idea of using the invertible 1x1 convolution is to permute all layers in general, instead of merely permuting the first and second half, as in Real NVP.

By the reparameterization trick, the autoregressive model is generalized to a normalizing flow:

Map this latent variable to data space with the following flow function: where

There are two main deficiencies of CNF, one is that a continuous flow must be a homeomorphism, thus preserve orientation and ambient isotopy (for example, it's impossible to flip a left-hand to a right-hand by continuous deforming of space, and it's impossible to turn a sphere inside out, or undo a knot), and the other is that the learned flow

might be ill-behaved, due to degeneracy (that is, there are an infinite number of possible

By adding extra dimensions, the CNF gains enough freedom to reverse orientation and go beyond ambient isotopy (just like how one can pick up a polygon from a desk and flip it around in 3-space, or unknot a knot in 4-space), yielding the "augmented neural ODE".

The paper [15] proposed the following regularization loss based on optimal transport theory:

The first term punishes the model for oscillating the flow field over time, and the second term punishes it for oscillating the flow field over space.

Both terms together guide the model into a flow that is smooth (not "bumpy") over space and time.

Despite normalizing flows success in estimating high-dimensional densities, some downsides still exist in their designs.

First of all, their latent space where input data is projected onto is not a lower-dimensional space and therefore, flow-based models do not allow for compression of data by default and require a lot of computation.

[21] Some hypotheses were formulated to explain this phenomenon, among which the typical set hypothesis,[22] estimation issues when training models,[23] or fundamental issues due to the entropy of the data distributions.

[24] One of the most interesting properties of normalizing flows is the invertibility of their learned bijective map.

The integrity of the inverse is important in order to ensure the applicability of the change-of-variable theorem, the computation of the Jacobian of the map as well as sampling with the model.

However, in practice this invertibility is violated and the inverse map explodes because of numerical imprecision.

Scheme for normalizing flows