KL Divergence
KL divergence or relative entropy is a measure of how one probability distribution is different from a reference probability distribution.
KL divergence lives in [0, ∞).
Properties
- It is asymmetric metric and thus cant be used as a distance metric.
- KL divergence 0 indicates that we can expect similar, if not the same, behavior of two different distributions, while 1 indicates that the two distributions behave in such a different manner that the expectation given the first distribution approaches zero.
Forward and backward KL
Assume p is true distribution and we want to approximate it with q.
Forward $\mathrm{KL}: K L(p \| q)=\int p \log \frac{p}{q} d z,$
In this case, the model will try to avoid placing zero probability mass anywhere where $p>0$ since this would lead to an exploding KL. This means it is safer to place non-zero $q$ anywhere where plausibly $p>0$ which leads to overestimation. This case is called zero-avoiding.
Reverse $\mathrm{KL}: K L(q \| p)=\int q \log \frac{q}{p} d z,$
In this case, the model will try to avoid situations where $q \approx 0$ and $p>0,$ thus it is safer to choose a mode and underestimate variance, rather than overestimate and risking to overshoot. This is called zero-forcing.
Above visualizations from cool interactive demo: https://observablehq.com/@stwind/forward-and-reverse-kl-divergences
KL divergence with unit gaussian
With a zero mean and unit variance Gaussian Distribution: $p=\mathcal{N}(0,1)$ such as Variational Autoencoders prior, we can actually find a closed-form solution of the $\mathrm{KL}$ divergence:
Relationship with MLE and Cross Entropy
Why is KL divergence referenced so much in machine learning? One reason is that it can be shown that Maximum Likelihood Estimation of data under a model is the same as minimizing the KL divergence between the data distribution and the model distribution i.e. $\left.D_{\mathrm{KL}}\left(p_{\text {data }} \| p_{\theta}\right)\right)$.
Minimizing KL divergence between two distributions corresponds to minimizing the Cross entropy between the distributions, and can be used equivalently when the entropy when one distribution is deterministic i.e. classification with hard labels.
References
- Notes on GAN objective functions by Daniel C Elton http://www.moreisdifferent.com/assets/science_notes/notes_on_GAN_objective_functions.pdf