Normalizing Flows
Often our posterior approximation is not enough. Imagine that the data is generated by two modes, then approximating with a standard Gaussian would be problematic.
If we applying a transformation to a simple density input e.g., a Gaussian, we can morph it into a more complicated density function. The key idea behind normalizing flows is that by doing this many times, we can model any complex density.
Change of variables
For 1 -d variables we know that
This is called change of variables (or integration by substitution). The density is $d u=g^{\prime}(x) d x$.
For multivariate cases
Note that both $u$ and $x$ have the same dimensionality. The magnitude of determinant of jacobian is the quantity that tells us how much the volume of the input and output probability spaces changes expands or contracts as the result of transformation. The volumes (sizes) must change so that $\int p(z)=1$ and $\int p(x)=1$. Normalizing flows expand or contract the density.
Normalizing flows model
Our model is an encoder $f: x \rightarrow z$. It maps the inpuit $x$ with density $p(x)$ to the latent $z$ with density $p(z)$.
The inverse model is the decoder $f^{-1}: z \rightarrow x$. It maps the latent $z$ back to the input $x$.
For our forward model we have
Thus we have an explicit connection between the density of the input and density of the output. This allows us to derive exact data-likelihood.
Stacking normalizing flows
The change of variables can be applied recursively
The log density of our data is
which we can optimize with Maximum Likelihood Estimation easily.
We are guarenteed to the limit that with enough transofrmations two distributions will match.
Transformations
We want smooth, differentiable transformations $f_{k}$
- For which it with easy to compute inverse $f_{k}^{-1}$
- and determinant of the Jacobian det $\frac{d f_{k}}{d z_{k}}$
Example transformations
Planar flows
Radial flows
Coupling layers
Planar flow
$u, w, b$ are free parameters and $h$ is an element-wise non-linearity (element-wise so that it is easy to invert)
The log-determinant of the Jacobian is
Radial flow
Captures more non-linear transformations
Where $h(\alpha, r)=1 /(\alpha+r)$
The log-determinant of the Jacobian is
Coupling layers
Given input $z$ the output of the transformation is
Basically, we split the input z and define the second part of input as an affine transformation. $\mu_{\theta}, \sigma_{\theta}$ are neural networks with shared parameters.
They are nice, because they have easy inverse.
$z=\left[\begin{array}{c}z_{1: j} \\ \frac{\left(z_{j+1: d}^{\prime}-\mu_{\theta}\left(z_{1 ; j}\right)\right)}{\sigma_{\theta}\left(z_{1: j}\right)}\end{array}\right]$
The jacobian is also easy to compute as it has triangluar structure
$\frac{\partial z^{\prime}}{\partial z}=\left[\begin{array}{cc}\mathbb{I}_{d} & 0 \\ \frac{\partial z_{j+1: d}^{\prime}}{\partial z_{1: j}} & \operatorname{diag}\left(\sigma_{\theta}\left(z_{1: j}\right)\right)\end{array}\right]$
The log determinant is $\sum_{j} \log \sigma_{\theta}\left(z_{j}\right)$
VAE with Normalizing Flows
Paper: Rezende and Mohamed, Variational Inference with Normalizing Flows
In Variational Autoencoders, we use approximate posterior with gaussian prior. We can combine NFs to which can return highly complex densities instead of simple priors.
The evidence lower bound is
Replace the simple approximate posterior by normalizing flows
With this combination, the KL term in the ELBO vanishes and we approach true posterior.
Effect of number of transformations
Has significant effects on the final density
Normalizing flows in images
Normalizing flows are continuous transformations, but images contain discrete values.
So a NF model will assign $\delta$ -peak probabilities on integer (pixel) values only. These probabilities will be nonsensical, there is no smoothness. We have to convert these peak probabilities to a continuous one.
(Variational) Dequantization
Dequantization ensures that the input values are continuously distributed. If the step is left out, the model will place all the probability mass on the discrete values, resulting in $\delta$-peaks around the discrete values. This means that we can't interpret these probabilities as likelihoods (since they are all infinite at the point are only taking finite values if integrated over), rendering the output useless for further processing.
We do this by addinf a (continuous) noise $u \sim q(u \mid x)$ to input variables $v=x+\underline{u}$. The data log-likelihood is then changed to integrate out the noise variable:
If $q(u \mid x)$ is the uniform distribution then we have standard dequantization
- Probability between two consecutive values is fixed, resemble boxy boundaries between values
Better learn $q(u|x)$ in a variational manner, which is called Variational dequantization.
Splitting dimensions for coupling layer
Use masking: Checkers pattern or Splitting across channels
We have to alternate dimensions between consecutive layers -> not always the same $1: d$ dimensions remain untouched.
Multi-scale architecture
One disadvantage of normalizing flows is that they operate on the exact same dimensions as the input. If the input is high-dimensional, so is the latent space, which requires larger computational cost to learn suitable transformations. However, particularly in the image domain, many pixels contain less information in the sense that we could remove them without loosing the semantical information of the image. Based on this intuition, deep normalizing flows on images commonly apply a multi-scale architecture. After the first N flow transformations, we split off half of the latent dimensions and directly evaluate them on the prior. The other half is run throughN more flow transformations, and depending on the size of the input, we split it again in half or stop overall at this position. The two operations involved in this setup is Squeeze and Split.
Squeeze opration: Ex: The input of 4×4×1 is scaled to 2×2×4 following the idea of grouping the pixels in 2×2×1 subsquares.
The split operation then divides the input into two parts, and evaluates one part directly on the prior.
Advantages and disadvantages
- Starting from a simple density like a unit Gaussian we can obtain any complex density that match our data without even knowing its analytic form
- Tractable density estimation
- Efficient parallel sampling and learning
- Often very many transformations required, so very large networks needed
- Constrained to invertible transformations with tractable determinant
- Tied encoder and decoder weights
- Transformations cannot easily introduce bottlenecks, we have to resort to solution like multi-scale archtecture.
References
- What are normalizing flows video by Ari Seff https://www.youtube.com/watch?v=i7LjDvsLWCg&%3Bt=490s
- Some good blogs on flows: https://maurocamaraescudero.netlify.app/ai-blogs/
- VAEs with normalizing flows https://arxiv.org/abs/1505.05770v6
- Lilian Weng's post https://lilianweng.github.io/lil-log/2018/10/13/flow-based-deep-generative-models.html
- Tutorial on NFs https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial11/NF_image_modeling.html
- Normalizing flow models notes CS236 https://deepgenerativemodels.github.io/notes/flow/#fnref:nf
- Lecture 11.3, UvA Deep Learning Course 2020