Generative Adversarial Networks

Generative - You can sample novel input samples. E.g., you can literally "create" images that never existed
Adversarial - Our generative model $G$ learns adversarially by fooling an discriminative oracle model $D$.
Network - Implemented typically as a (deep) neural network making it easy to incorporate new modules, easy to learn via backpropagation.

Architecture

The GAN comprises two neural networks

Generator network $x=G\left(z ; \theta_{G}\right)$

Discriminator network $y=D\left(x ; \theta_{D}\right)=\left\{\begin{array}{l}+1, \text { if } x \text { is predicted 'real }^{\prime} \\ 0, \text { if } x \text { is predicted 'fake }^{\prime}\end{array}\right.$

gan-arch

Note: there is no 'encoder'. We cannot learn a representation for an image $x$. We cannot compute a likelihood of a specific data point. At test time we can only generate new data points.

Generator network

$$ x=G\left(z ; \theta_{G}\right) $$
  • Can be any differentiable neural network
  • No invertibility requirement allowing more flexible modelling
  • Trainable for any size of $z$
  • Various density functions for the noise variable $z$

Discriminator network

$$ \boldsymbol{y}=D\left(\boldsymbol{x} ; \boldsymbol{\theta}_{\mathrm{D}}\right) $$
  • Can beany differentiable neural network
  • Receives as inputs either real images from the training set or generated images from the generator, usually a mix of both in mini-batches
  • The discriminator must recognize the real from the fake inputs

Pipeline

gan-pipeline

Learning objectives

  • Not obvious how to use Maximum Likelihood Estimation
  • If we take the output of the generator, how to train the discriminator?
  • Even then, how do we know if a completely new $x$ is likely or not? Remember, we have no encoder, so no target to compare against.
$$ \begin{array}{lll} \hline \text { Symbol } & {\text { Meaning }} & {\text { Notes }} \\ \hline p_{z} & \text { Data distribution over noise input } z & \text { Usually, just uniform. } \\ p_{g} & \text { The generator's distribution over data } x & \\ p_{r} & \text { Data distribution over real sample } x & \\ \hline \end{array} $$

Minimax Game

For the simple case of zero-sum game

$$ J_{G}=-J_{D} $$

The lower the generator loss, the higher the discriminator loss
Symmetric definitions
Our learning objective then becomes

$$ V=-J_{D}\left(\boldsymbol{\theta}_{\mathrm{D}}, \boldsymbol{\theta}_{\mathrm{G}}\right) $$

$D(x)=1$ -> The discriminator believes that $x$ is a true image
$D(G(z))=1$ -> The discriminator believes that $G(z)$ is a true image

Learning stops after a while. As training iterations increase the discriminator improves: $\frac{d J_{D}}{d \theta_{\mathrm{D}}} \rightarrow 0$ Then, the generator, preceding the discriminator, vanish gradients.

  • Equilibrium is a saddle point of the discriminator loss
  • Final loss resembles Jenssen-Shannon divergence
  • This allows for easier theoretical analysis

Heuristic non-saturating game

This is the most widely used objective.
Discriminator loss

$$ J_{D}=-\frac{1}{2} \mathbb{E}_{x \sim p_{\text {data}}} \log D(x)-\frac{1}{2} \mathbb{E}_{z \sim p_{z}} \log (1-D(G(z))) $$

Generator loss

$$ J_{G}=-\frac{1}{2} \mathbb{E}_{z \sim p_{z}} \log (D(G(z)) $$

Equilibrium not any more describable by single loss

  • The discriminator maximizes the log-likelihood of the discriminator correctly discovering real $\log D(x)$ and fake $\log (1-D(G(z)))$ samples
  • The generator $G(z)$ maximizes the log-likelihood of the discriminator $\log (D(G(z))$ being wrong. Doesn't care if $D$ gets confused with real samples.

Heuristically motivated; generator can still learn even when discriminator successfully rejects all generator samples.

$$ \min _{G} \max _{D} V(D, G)=\min _{G} \max _{D} \mathbb{E}_{p_{\text {data }}(x)}[\log D(X)]+\mathbb{E}_{p_{z}(z)}[\log (1-D(G(Z)))] $$

There are two terms in the above GAN training objective. The first term maximizes the log-probability of discriminator predicting that real-world data as correct. The second term maximizes the log-probability of discriminator predicting that generated data by generator as incorrect.

The generator, on the other hand minimizes the log-probability of the discriminator being correct.

gan-schematic

Image Credit

Maximum likelihood cost

We can modify for maximum likelihood by keeping discriminator loss the same as above and generator activaing by an inverse sigmoid.

$$ J_{G}=-\frac{1}{2} \mathbb{E}_{z} \log \left(\sigma^{-1}(D(G(z)))\right. $$

In this case, when discriminator is optimal $\frac{d J_{D}}{d \theta_{D}} \rightarrow 0$, the generator gradient matches that of maximum likelihood.

Comparision of generator losses

generator-losses

Optimial discriminator

Optimal $D(x)$ for any $p_{\text {data}}(x)$ and $p_{\text {model}}(x)$ is always

$$ D(x)=\frac{p_{\text {data}}(\boldsymbol{x})}{p_{\text {data}}(\boldsymbol{x})+p_{\text {model}}(\boldsymbol{x})} $$

Estimating this ratio with supervised learning (discriminator) is the key.

How is this optimial discriminator?

$L(D, G)=\int_{x} p_{r}(x) \log D(x)+p_{g}(x) \log (1-D(x)) d x$

  • Minimize $\mathcal{L}(D, G)$ w.r.t. $D \rightarrow \frac{d \mathcal{L}}{d D}=0$ and ignore the integral (sample over $\left.x\right)$
  • The function $x \rightarrow a \log x+b \log (1-x)$ attains $\max$ in [0,1] at $\frac{a}{a+b}$

The optimial discriminator

$$ D^{*}(x)=\frac{p_{r}(x)}{p_{r}(x)+p_{g}(x)} $$

And at optimality $p_{g}(\boldsymbol{x}) \rightarrow p_{r}(\boldsymbol{x})$, thus

$$ \begin{aligned} & D^{*}(\boldsymbol{x})=\frac{1}{2} \\ & L\left(G^{*}, D^{*}\right)=-2 \log 2 \end{aligned} $$

GAN and Jensen-Shannon divergence

Expanding the Jensen–Shannon Divergence for the optimal discriminator $D^{*}(\boldsymbol{x})=\frac{p_{r}(\boldsymbol{x})}{p_{r}(\boldsymbol{x})+p_{g}(\boldsymbol{x})}$,

$$ \begin{array}{c} D_{J S}\left(p_{r} \| p_{g}\right)=\frac{1}{2} D_{K L}\left(p_{r} \| \frac{p_{r}+p_{g}}{2}\right)+\frac{1}{2} D_{K L}\left(p_{g} \| \frac{p_{r}+p_{g}}{2}\right) \\ =\frac{1}{2}\left(\log 2+\int_{\chi} p_{r}(x) \log \frac{p_{r}(x)}{p_{r}(x)+p_{g}(x)} d x+\log 2+\int_{x} p_{g}(x) \log \frac{p_{g}(x)}{p_{r}(x)+p_{g}(x)} d x\right) \\ =\frac{1}{2}\left(\log 4+L\left(G, D^{*}\right)\right) \end{array} $$

So, its interesting to see that $L\left(G, D^{*}\right)=2 D_{J S}\left(p_{r} \| p_{g}\right)-2 \log 2$, and for $L\left(G^{*}, D^{*}\right)\Rightarrow D_{J S}\left(p_{r} \| p_{g}\right)=0$.

So GANs are optimizing rescaled version of JS Divergence.

Some believe (Huszar, 2015) that one reason behind GANs’ big success is switching the loss function from asymmetric KL Divergence in traditional maximum-likelihood approach to symmetric Jensen–Shannon Divergence. How?

$D_{K L}\left(p(x) \| q^{*}(x)\right)$ -> high probability everywhere that the data occurs
$D_{K L}\left(q^{*}(x) \| p(x)\right)$ -> low probability wherever the data does not occur

Backward KL is 'zero forcing' the learned model -> makes model "conservative" and avoids area where $p(x)=0$.
kl-backward-forward

Other GAN cost functions

gan_cost functions

References

  1. NeurIPS GAN Workshop, 2014
  2. Lecture 10.2, UvA DL course 2020
  3. Lilian Weng's post on GANs https://lilianweng.github.io/lil-log/2017/08/20/from-GAN-to-WGAN.html#what-is-the-global-optimal
  4. Why is it so hard to train GANs by Jonathan Hui https://jonathan-hui.medium.com/gan-why-it-is-so-hard-to-train-generative-advisory-networks-819a86b3750b
  5. Ways to improve GAN performance by Jonathan Hui https://towardsdatascience.com/gan-ways-to-improve-gan-performance-acf37f9f59b