Why Generative Models

Created December 15, 2020 · Updated March 4, 2026

Discriminative models frame the problem of:
Given an individual input $$x$$ predict

the correct label (classification)
the correct score (regression)
They are optimized by maximizing the probability of individual targets

In Probabilistic Generative Models, we model the data jointly i.e we want to know what is the distribution of data. For instance, we want to know how likely is $$x_a$$ , or if it is more likely than $$x_b$$ .

Why/when to learn a distribution?

Density estimation: estimate the probability of $$x$$
Sampling: generate new plausible $$x$$ , E.g., Model Based Reinforcement Learning
Structure/representation learning; learn good features of $$x$$ unsupervised
Generative models are widely to pretrain for downstream task.
Generative models to ensure generalization E.g., Model Based Reinforcement Learning, Semi Supervised Learning, Simulations

The world as a distribution

Challenges

We are interested in parametric models from a family of models $\mathcal{M}$ .
How to pick the right family of models $\mathcal{M}$ ?
How to know which $\theta$ from $\mathcal{M}$ is a good one?
How to learn/optimize our models from family $\mathcal{M} ?$

Properties for modelling distributions

We want to learn distributions $p_{\theta}(x)$

Our model must the refore have the following properties

Non-negativity $\left(p_{\theta}(x) \geq 0 \forall x\right.$
Probabilities of allevents must sum up to $1: \int_{x} p_{\theta}(x) d x=1$

Summing up to 1 (normalization) makes sure predictions improve relatively

Model cannot trivially get better scores by predicting higher numbers
The pie remains the same -> model forced to make non-trivial improvements

Easy to obtain non-negativity

Consider: $g_{\theta}(x)=f_{\theta}^{2}(x)$ where $f_{\theta}$ is a neural network
Or $g_{\theta}(x)=\exp \left(f_{\theta}(x)\right)$
But they do not sum up to 1. What can we do?

Normalize by the total volume of the function

p_{\theta}(x)=\frac{1}{\text { volume }\left(g_{\theta}\right)} g_{\theta}(x)=\frac{1}{\int_{x} g_{\theta}(x) d x} g_{\theta}(x)

In simple words, equivalent to normalizing (3,1,4) as $\frac{1}{3+1+4}[3,1,4]$
Examples:

$g_{\theta=(\mu, \sigma)}(x)=\exp \left(-(x-\mu)^{2} / 2 \sigma^{2}\right) \Rightarrow$ Volume $\left(g_{\theta}\right)=\sqrt{2 \pi \sigma^{2}} \Rightarrow$ Gaussian

$g_{\theta=\lambda}(x)=\exp (-\lambda x) \Rightarrow$ Volume $\left(g_{\theta}\right)=\frac{1}{\lambda} \Rightarrow$ Exponential

Must find convenient $g_{\theta}$ to be able to compute the integral analytically. Otherwise we cannot make sure of valid probabilities.

Why is learning a distribution hard?

The integrals mean that learning distributions becomes harder with scale

Think of $300 \times 400$ color images with (0,256) color range

The number of possible images $$x$$ is $256^{3 \cdot 300 \cdot 400}$
In principle we must assign a probability to all of them

While easy to define a family of models, we got a $\int_{x} g_{\theta}(x) d x$

Not always easy how to sample (needed for evaluating)
Not always easy how to optimize (needed for training)
Not always data efficient (long training times)
Not always sample efficient (many samples needed for accuracy)

Why/when not to learn a distibution?

"One should solve the [classification] problem directly and never solve a more general [and harder] problem as an intermediate step." ~ V. Vapnik, father of SVMs.

Generative models to be preferred

when probabilities are important
when you got no human annotations and want to learn features
when you want to generalize to (many) downstream tasks
when the answer to your question is not: "more data"

If you have a very specific classification task and lots of data

no need to make things complicated

Map of generative models

References

Lecture 8, UvA DL course 2020