Generative Adversarial Networks

Created December 23, 2020 · Updated March 4, 2026

Generative - You can sample novel input samples. E.g., you can literally "create" images that never existed
Adversarial - Our generative model $$G$$ learns adversarially by fooling an discriminative oracle model $$D$$ .
Network - Implemented typically as a (deep) neural network making it easy to incorporate new modules, easy to learn via backpropagation.

Architecture

The GAN comprises two neural networks

Generator network $x=G\left(z ; \theta_{G}\right)$

Discriminator network $y=D\left(x ; \theta_{D}\right)=\left\{\begin{array}{l}+1, \text { if } x \text { is predicted 'real }^{\prime} \\ 0, \text { if } x \text { is predicted 'fake }^{\prime}\end{array}\right.$

Note: there is no 'encoder'. We cannot learn a representation for an image $$x$$ . We cannot compute a likelihood of a specific data point. At test time we can only generate new data points.

Generator network

x=G\left(z ; \theta_{G}\right)

Can be any differentiable neural network
No invertibility requirement allowing more flexible modelling
Trainable for any size of $$z$$
Various density functions for the noise variable $$z$$

Discriminator network

\boldsymbol{y}=D\left(\boldsymbol{x} ; \boldsymbol{\theta}_{\mathrm{D}}\right)

Can beany differentiable neural network
Receives as inputs either real images from the training set or generated images from the generator, usually a mix of both in mini-batches
The discriminator must recognize the real from the fake inputs

Pipeline

Learning objectives

Not obvious how to use Maximum Likelihood Estimation
If we take the output of the generator, how to train the discriminator?
Even then, how do we know if a completely new $$x$$ is likely or not? Remember, we have no encoder, so no target to compare against.

\begin{array}{lll} \hline \text { Symbol } & {\text { Meaning }} & {\text { Notes }} \\ \hline p_{z} & \text { Data distribution over noise input } z & \text { Usually, just uniform. } \\ p_{g} & \text { The generator's distribution over data } x & \\ p_{r} & \text { Data distribution over real sample } x & \\ \hline \end{array}

Minimax Game

For the simple case of zero-sum game

J_{G}=-J_{D}

The lower the generator loss, the higher the discriminator loss
Symmetric definitions
Our learning objective then becomes

V=-J_{D}\left(\boldsymbol{\theta}_{\mathrm{D}}, \boldsymbol{\theta}_{\mathrm{G}}\right)

$$D(x)=1$$ -> The discriminator believes that $$x$$ is a true image
$$D(G(z))=1$$ -> The discriminator believes that $$G(z)$$ is a true image

Learning stops after a while. As training iterations increase the discriminator improves: $\frac{d J_{D}}{d \theta_{\mathrm{D}}} \rightarrow 0$ Then, the generator, preceding the discriminator, vanish gradients.

Equilibrium is a saddle point of the discriminator loss
Final loss resembles Jenssen-Shannon divergence
This allows for easier theoretical analysis

Heuristic non-saturating game

This is the most widely used objective.
Discriminator loss

J_{D}=-\frac{1}{2} \mathbb{E}_{x \sim p_{\text {data}}} \log D(x)-\frac{1}{2} \mathbb{E}_{z \sim p_{z}} \log (1-D(G(z)))

Generator loss

J_{G}=-\frac{1}{2} \mathbb{E}_{z \sim p_{z}} \log (D(G(z))

Equilibrium not any more describable by single loss

The discriminator maximizes the log-likelihood of the discriminator correctly discovering real $\log D(x)$ and fake $\log (1-D(G(z)))$ samples
The generator $$G(z)$$ maximizes the log-likelihood of the discriminator $\log (D(G(z))$ being wrong. Doesn't care if $$D$$ gets confused with real samples.

Heuristically motivated; generator can still learn even when discriminator successfully rejects all generator samples.

\min _{G} \max _{D} V(D, G)=\min _{G} \max _{D} \mathbb{E}_{p_{\text {data }}(x)}[\log D(X)]+\mathbb{E}_{p_{z}(z)}[\log (1-D(G(Z)))]

There are two terms in the above GAN training objective. The first term maximizes the log-probability of discriminator predicting that real-world data as correct. The second term maximizes the log-probability of discriminator predicting that generated data by generator as incorrect.

The generator, on the other hand minimizes the log-probability of the discriminator being correct.

Image Credit

Maximum likelihood cost

We can modify for maximum likelihood by keeping discriminator loss the same as above and generator activaing by an inverse sigmoid.

J_{G}=-\frac{1}{2} \mathbb{E}_{z} \log \left(\sigma^{-1}(D(G(z)))\right.

In this case, when discriminator is optimal $\frac{d J_{D}}{d \theta_{D}} \rightarrow 0$ , the generator gradient matches that of maximum likelihood.

Comparision of generator losses

Optimial discriminator

Optimal $$D(x)$$ for any $p_{\text {data}}(x)$ and $p_{\text {model}}(x)$ is always

D(x)=\frac{p_{\text {data}}(\boldsymbol{x})}{p_{\text {data}}(\boldsymbol{x})+p_{\text {model}}(\boldsymbol{x})}

Estimating this ratio with supervised learning (discriminator) is the key.

How is this optimial discriminator?

$L(D, G)=\int_{x} p_{r}(x) \log D(x)+p_{g}(x) \log (1-D(x)) d x$

Minimize $\mathcal{L}(D, G)$ w.r.t. $D \rightarrow \frac{d \mathcal{L}}{d D}=0$ and ignore the integral (sample over $\left.x\right)$
The function $x \rightarrow a \log x+b \log (1-x)$ attains $\max$ in [0,1] at $\frac{a}{a+b}$

The optimial discriminator

D^{*}(x)=\frac{p_{r}(x)}{p_{r}(x)+p_{g}(x)}

And at optimality $p_{g}(\boldsymbol{x}) \rightarrow p_{r}(\boldsymbol{x})$ , thus

\begin{aligned} & D^{*}(\boldsymbol{x})=\frac{1}{2} \\ & L\left(G^{*}, D^{*}\right)=-2 \log 2 \end{aligned}

GAN and Jensen-Shannon divergence

Expanding the Jensen–Shannon Divergence for the optimal discriminator $D^{*}(\boldsymbol{x})=\frac{p_{r}(\boldsymbol{x})}{p_{r}(\boldsymbol{x})+p_{g}(\boldsymbol{x})}$ ,

\begin{array}{c} D_{J S}\left(p_{r} \| p_{g}\right)=\frac{1}{2} D_{K L}\left(p_{r} \| \frac{p_{r}+p_{g}}{2}\right)+\frac{1}{2} D_{K L}\left(p_{g} \| \frac{p_{r}+p_{g}}{2}\right) \\ =\frac{1}{2}\left(\log 2+\int_{\chi} p_{r}(x) \log \frac{p_{r}(x)}{p_{r}(x)+p_{g}(x)} d x+\log 2+\int_{x} p_{g}(x) \log \frac{p_{g}(x)}{p_{r}(x)+p_{g}(x)} d x\right) \\ =\frac{1}{2}\left(\log 4+L\left(G, D^{*}\right)\right) \end{array}

So, its interesting to see that $L\left(G, D^{*}\right)=2 D_{J S}\left(p_{r} \| p_{g}\right)-2 \log 2$ , and for $L\left(G^{*}, D^{*}\right)\Rightarrow D_{J S}\left(p_{r} \| p_{g}\right)=0$ .

So GANs are optimizing rescaled version of JS Divergence.

Some believe (Huszar, 2015) that one reason behind GANs’ big success is switching the loss function from asymmetric KL Divergence in traditional maximum-likelihood approach to symmetric Jensen–Shannon Divergence. How?

$D_{K L}\left(p(x) \| q^{*}(x)\right)$ -> high probability everywhere that the data occurs
$D_{K L}\left(q^{*}(x) \| p(x)\right)$ -> low probability wherever the data does not occur

Backward KL is 'zero forcing' the learned model -> makes model "conservative" and avoids area where $$p(x)=0$$ .
kl-backward-forward

Other GAN cost functions

References

NeurIPS GAN Workshop, 2014
Lecture 10.2, UvA DL course 2020
Lilian Weng's post on GANs https://lilianweng.github.io/lil-log/2017/08/20/from-GAN-to-WGAN.html#what-is-the-global-optimal
Why is it so hard to train GANs by Jonathan Hui https://jonathan-hui.medium.com/gan-why-it-is-so-hard-to-train-generative-advisory-networks-819a86b3750b
Ways to improve GAN performance by Jonathan Hui https://towardsdatascience.com/gan-ways-to-improve-gan-performance-acf37f9f59b