Wasserstein GAN

The Wasserstein Generative Adversarial Network (WGAN) is a variant of generative adversarial network (GAN) proposed in 2017 that aims to "improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches".

[1][2] Compared with the original GAN discriminator, the Wasserstein GAN discriminator provides a better learning signal to the generator.

This allows the training to be more stable when generator is learning distributions in very high dimensional spaces.

The original GAN method is based on the GAN game, a zero-sum game with 2 players: generator and discriminator.

The game is defined over a probability space

A basic theorem of the GAN game states thatTheorem (the optimal discriminator computes the Jensen–Shannon divergence) — For any fixed generator strategy

changes, the discriminator must adapt by approaching the ideal

[note 1] Concretely, in the GAN game, let us fix a generator

Thus, we see that the point of the discriminator is mainly as a critic to provide feedback for the generator, about "how far it is from perfection", where "far" is defined as Jensen–Shannon divergence.

Naturally, this brings the possibility of using a different criteria of farness.

There are many possible divergences to choose from, such as the f-divergence family, which would give the f-GAN.

[3] The Wasserstein GAN is obtained by using the Wasserstein metric, which satisfies a "dual representation theorem" that renders it highly efficient to compute: Theorem (Kantorovich-Rubenstein duality) — When the probability space

A proof can be found in the main page on Wasserstein metric.

By the Kantorovich-Rubenstein duality, the definition of Wasserstein GAN is clear:A Wasserstein GAN game is defined by a probability space

There are 2 players: generator and discriminator (also called "critic").

Consequently, if the discriminator is good, the generator would be constantly pushed to minimize

are plotted as below: For fixed discriminator, the generator needs to minimize the following objectives: Let

[note 2] As shown, the generator in GAN is motivated to let its

Similarly for the generator in Wasserstein GAN.

is much more severe in actual machine learning situations.

Consider training a GAN to generate ImageNet, a collection of photos of size 256-by-256.

, concentrates on a manifold of much lower dimension in it.

[4] Training the generator in Wasserstein GAN is just gradient descent, the same as in GAN (or most deep learning methods), but training the discriminator is different, as the discriminator is now restricted to have bounded Lipschitz norm.

This is the weight clipping method, proposed by the original paper.

The spectral radius can be efficiently computed by the following algorithm:INPUT matrix

after each update of the discriminator, we can upper bound

The algorithm can be further accelerated by memoization: At step

, we can simply add a "gradient penalty" term for the discriminator, of form

is a fixed distribution used to estimate how much the discriminator has violated the Lipschitz norm requirement.

The discriminator, in attempting to minimize the new loss function, would naturally bring