Why Generative Models

Discriminative models frame the problem of:
Given an individual input $x$ predict

  • the correct label (classification)
  • the correct score (regression)
    They are optimized by maximizing the probability of individual targets

In Probabilistic Generative Models, we model the data jointly i.e we want to know what is the distribution of data. For instance, we want to know how likely is $x_a$, or if it is more likely than $x_b$.

data-distribution

Why/when to learn a distribution?

  • Density estimation: estimate the probability of $x$
  • Sampling: generate new plausible $x$, E.g., Model Based Reinforcement Learning
  • Structure/representation learning; learn good features of $x$ unsupervised
  • Generative models are widely to pretrain for downstream task.
  • Generative models to ensure generalization E.g., Model Based Reinforcement Learning, Semi Supervised Learning, Simulations

The world as a distribution

world-distribution

Challenges

  1. We are interested in parametric models from a family of models $\mathcal{M}$.
  2. How to pick the right family of models $\mathcal{M}$?
  3. How to know which $\theta$ from $\mathcal{M}$ is a good one?
  4. How to learn/optimize our models from family $\mathcal{M} ?$

Properties for modelling distributions

We want to learn distributions $p_{\theta}(x)$

Our model must the refore have the following properties

  • Non-negativity $\left(p_{\theta}(x) \geq 0 \forall x\right.$
  • Probabilities of allevents must sum up to $1: \int_{x} p_{\theta}(x) d x=1$

Summing up to 1 (normalization) makes sure predictions improve relatively

  • Model cannot trivially get better scores by predicting higher numbers
  • The pie remains the same -> model forced to make non-trivial improvements

Easy to obtain non-negativity

  • Consider: $g_{\theta}(x)=f_{\theta}^{2}(x)$ where $f_{\theta}$ is a neural network
  • Or $g_{\theta}(x)=\exp \left(f_{\theta}(x)\right)$
    But they do not sum up to 1. What can we do?

Normalize by the total volume of the function

$$ p_{\theta}(x)=\frac{1}{\text { volume }\left(g_{\theta}\right)} g_{\theta}(x)=\frac{1}{\int_{x} g_{\theta}(x) d x} g_{\theta}(x) $$

In simple words, equivalent to normalizing (3,1,4) as $\frac{1}{3+1+4}[3,1,4]$
Examples:

$g_{\theta=(\mu, \sigma)}(x)=\exp \left(-(x-\mu)^{2} / 2 \sigma^{2}\right) \Rightarrow$ Volume $\left(g_{\theta}\right)=\sqrt{2 \pi \sigma^{2}} \Rightarrow$ Gaussian

$g_{\theta=\lambda}(x)=\exp (-\lambda x) \Rightarrow$ Volume $\left(g_{\theta}\right)=\frac{1}{\lambda} \Rightarrow$ Exponential

Must find convenient $g_{\theta}$ to be able to compute the integral analytically. Otherwise we cannot make sure of valid probabilities.

Why is learning a distribution hard?

The integrals mean that learning distributions becomes harder with scale

Think of $300 \times 400$ color images with (0,256) color range

  • The number of possible images $x$ is $256^{3 \cdot 300 \cdot 400}$
  • In principle we must assign a probability to all of them

While easy to define a family of models, we got a $\int_{x} g_{\theta}(x) d x$

  • Not always easy how to sample (needed for evaluating)
  • Not always easy how to optimize (needed for training)
  • Not always data efficient (long training times)
  • Not always sample efficient (many samples needed for accuracy)

Why/when not to learn a distibution?

"One should solve the [classification] problem directly and never solve a more general [and harder] problem as an intermediate step." ~ V. Vapnik, father of SVMs.

Generative models to be preferred

  • when probabilities are important
  • when you got no human annotations and want to learn features
  • when you want to generalize to (many) downstream tasks
  • when the answer to your question is not: "more data"

If you have a very specific classification task and lots of data

  • no need to make things complicated

Map of generative models

map-of-generative models

References

  1. Lecture 8, UvA DL course 2020