Generative Adversarial Networks
Generative - You can sample novel input samples. E.g., you can literally "create" images that never existed
Adversarial - Our generative model $G$ learns adversarially by fooling an discriminative oracle model $D$.
Network - Implemented typically as a (deep) neural network making it easy to incorporate new modules, easy to learn via backpropagation.
Architecture
The GAN comprises two neural networks
Generator network $x=G\left(z ; \theta_{G}\right)$
Discriminator network $y=D\left(x ; \theta_{D}\right)=\left\{\begin{array}{l}+1, \text { if } x \text { is predicted 'real }^{\prime} \\ 0, \text { if } x \text { is predicted 'fake }^{\prime}\end{array}\right.$
Note: there is no 'encoder'. We cannot learn a representation for an image $x$. We cannot compute a likelihood of a specific data point. At test time we can only generate new data points.
Generator network
- Can be any differentiable neural network
- No invertibility requirement allowing more flexible modelling
- Trainable for any size of $z$
- Various density functions for the noise variable $z$
Discriminator network
- Can beany differentiable neural network
- Receives as inputs either real images from the training set or generated images from the generator, usually a mix of both in mini-batches
- The discriminator must recognize the real from the fake inputs
Pipeline
Learning objectives
- Not obvious how to use Maximum Likelihood Estimation
- If we take the output of the generator, how to train the discriminator?
- Even then, how do we know if a completely new $x$ is likely or not? Remember, we have no encoder, so no target to compare against.
Minimax Game
For the simple case of zero-sum game
The lower the generator loss, the higher the discriminator loss
Symmetric definitions
Our learning objective then becomes
$D(x)=1$ -> The discriminator believes that $x$ is a true image
$D(G(z))=1$ -> The discriminator believes that $G(z)$ is a true image
Learning stops after a while. As training iterations increase the discriminator improves: $\frac{d J_{D}}{d \theta_{\mathrm{D}}} \rightarrow 0$ Then, the generator, preceding the discriminator, vanish gradients.
- Equilibrium is a saddle point of the discriminator loss
- Final loss resembles Jenssen-Shannon divergence
- This allows for easier theoretical analysis
Heuristic non-saturating game
This is the most widely used objective.
Discriminator loss
Generator loss
Equilibrium not any more describable by single loss
- The discriminator maximizes the log-likelihood of the discriminator correctly discovering real $\log D(x)$ and fake $\log (1-D(G(z)))$ samples
- The generator $G(z)$ maximizes the log-likelihood of the discriminator $\log (D(G(z))$ being wrong. Doesn't care if $D$ gets confused with real samples.
Heuristically motivated; generator can still learn even when discriminator successfully rejects all generator samples.
There are two terms in the above GAN training objective. The first term maximizes the log-probability of discriminator predicting that real-world data as correct. The second term maximizes the log-probability of discriminator predicting that generated data by generator as incorrect.
The generator, on the other hand minimizes the log-probability of the discriminator being correct.
Maximum likelihood cost
We can modify for maximum likelihood by keeping discriminator loss the same as above and generator activaing by an inverse sigmoid.
In this case, when discriminator is optimal $\frac{d J_{D}}{d \theta_{D}} \rightarrow 0$, the generator gradient matches that of maximum likelihood.
Comparision of generator losses
Optimial discriminator
Optimal $D(x)$ for any $p_{\text {data}}(x)$ and $p_{\text {model}}(x)$ is always
Estimating this ratio with supervised learning (discriminator) is the key.
How is this optimial discriminator?
$L(D, G)=\int_{x} p_{r}(x) \log D(x)+p_{g}(x) \log (1-D(x)) d x$
- Minimize $\mathcal{L}(D, G)$ w.r.t. $D \rightarrow \frac{d \mathcal{L}}{d D}=0$ and ignore the integral (sample over $\left.x\right)$
- The function $x \rightarrow a \log x+b \log (1-x)$ attains $\max$ in [0,1] at $\frac{a}{a+b}$
The optimial discriminator
And at optimality $p_{g}(\boldsymbol{x}) \rightarrow p_{r}(\boldsymbol{x})$, thus
GAN and Jensen-Shannon divergence
Expanding the Jensen–Shannon Divergence for the optimal discriminator $D^{*}(\boldsymbol{x})=\frac{p_{r}(\boldsymbol{x})}{p_{r}(\boldsymbol{x})+p_{g}(\boldsymbol{x})}$,
So, its interesting to see that $L\left(G, D^{*}\right)=2 D_{J S}\left(p_{r} \| p_{g}\right)-2 \log 2$, and for $L\left(G^{*}, D^{*}\right)\Rightarrow D_{J S}\left(p_{r} \| p_{g}\right)=0$.
So GANs are optimizing rescaled version of JS Divergence.
Some believe (Huszar, 2015) that one reason behind GANs’ big success is switching the loss function from asymmetric KL Divergence in traditional maximum-likelihood approach to symmetric Jensen–Shannon Divergence. How?
$D_{K L}\left(p(x) \| q^{*}(x)\right)$ -> high probability everywhere that the data occurs
$D_{K L}\left(q^{*}(x) \| p(x)\right)$ -> low probability wherever the data does not occur
Backward KL is 'zero forcing' the learned model -> makes model "conservative" and avoids area where $p(x)=0$.

Other GAN cost functions
References
- NeurIPS GAN Workshop, 2014
- Lecture 10.2, UvA DL course 2020
- Lilian Weng's post on GANs https://lilianweng.github.io/lil-log/2017/08/20/from-GAN-to-WGAN.html#what-is-the-global-optimal
- Why is it so hard to train GANs by Jonathan Hui https://jonathan-hui.medium.com/gan-why-it-is-so-hard-to-train-generative-advisory-networks-819a86b3750b
- Ways to improve GAN performance by Jonathan Hui https://towardsdatascience.com/gan-ways-to-improve-gan-performance-acf37f9f59b