Normalizing Flows

Created December 23, 2020 · Updated March 4, 2026

Often our posterior approximation is not enough. Imagine that the data is generated by two modes, then approximating with a standard Gaussian would be problematic.

If we applying a transformation to a simple density input e.g., a Gaussian, we can morph it into a more complicated density function. The key idea behind normalizing flows is that by doing this many times, we can model any complex density.

Change of variables

For 1 -d variables we know that

\int f(g(x)) g^{\prime}(x) d x=\int f(u) d u, \text { where } u=g(x)

This is called change of variables (or integration by substitution). The density is $d u=g^{\prime}(x) d x$ .

For multivariate cases

\int f(g(x))\left|\operatorname{det} \frac{d g}{d x}\right| d x=\int f(\boldsymbol{u}) d \boldsymbol{u}, \text { where } \boldsymbol{u}=g(\boldsymbol{x})

Note that both $$u$$ and $$x$$ have the same dimensionality. The magnitude of determinant of jacobian is the quantity that tells us how much the volume of the input and output probability spaces changes expands or contracts as the result of transformation. The volumes (sizes) must change so that $\int p(z)=1$ and $\int p(x)=1$ . Normalizing flows expand or contract the density.

Normalizing flows model

Our model is an encoder $f: x \rightarrow z$ . It maps the inpuit $$x$$ with density $$p(x)$$ to the latent $$z$$ with density $$p(z)$$ .

The inverse model is the decoder $f^{-1}: z \rightarrow x$ . It maps the latent $$z$$ back to the input $$x$$ .

For our forward model we have

p(z)\left|\operatorname{det} \frac{d f}{d x}\right|=p(x) \Leftrightarrow \log p(z)+\log \left|\operatorname{det} \frac{d f}{d x}\right|=\log p(x)

Thus we have an explicit connection between the density of the input and density of the output. This allows us to derive exact data-likelihood.

Stacking normalizing flows

The change of variables can be applied recursively

\begin{aligned} x=z_{K} &=f_{K} \circ \cdots \circ f_{2} \circ f_{1}\left(z_{0}\right) \\ \left(e . g_{., z_{0}}\right.&\left.=f_{1}\left(z_{1}\right) \Leftrightarrow z_{1}=f_{1}^{-1}\left(z_{0}\right)\right) \end{aligned}

The log density of our data is

\log p(\boldsymbol{x})=\log p_{K}\left(\boldsymbol{z}_{K}\right)=\log p_{0}\left(\boldsymbol{z}_{0}\right)-\sum_{k=1}^{K} \log \operatorname{det}\left|\frac{\partial f_{k}}{\partial \boldsymbol{z}_{k}}\right|

which we can optimize with Maximum Likelihood Estimation easily.

We are guarenteed to the limit that with enough transofrmations two distributions will match.

Transformations

We want smooth, differentiable transformations $f_{k}$

For which it with easy to compute inverse $f_{k}^{-1}$
and determinant of the Jacobian det $\frac{d f_{k}}{d z_{k}}$

Example transformations
Planar flows
Radial flows
Coupling layers

Planar flow

f(z)=z+u h\left(w^{T} z+b\right)

$$u, w, b$$ are free parameters and $$h$$ is an element-wise non-linearity (element-wise so that it is easy to invert)

The log-determinant of the Jacobian is

\begin{array}{c} \psi(z)=h^{\prime}\left(w^{T} z+b\right) w \\ \operatorname{det}\left|\frac{\partial f}{\partial z}\right|=\left|1+u^{T} \psi(z)\right| \end{array}

Radial flow

Captures more non-linear transformations

f(z)=z+\beta h(\alpha, r)\left(z-z_{0}\right)

Where $h(\alpha, r)=1 /(\alpha+r)$

The log-determinant of the Jacobian is

\operatorname{det}\left|\frac{\partial f}{\partial \mathbf{z}}\right|=[1+\beta h(\alpha, r)]^{d-1}\left[1+\beta h(\alpha, r)+h^{\prime}(\alpha, r) r\right]

Coupling layers

Given input $$z$$ the output of the transformation is

z^{\prime}=\left[\begin{array}{c} z_{1: j}^{\prime} \\ z_{j+1: d}^{\prime} \end{array}\right]=\left[\begin{array}{c} z_{1: j} \\ \mu_{\theta}\left(z_{1: j}\right)+\sigma_{\theta}\left(z_{1: j}\right) \odot z_{j+1: d} \end{array}\right]

Basically, we split the input z and define the second part of input as an affine transformation. $\mu_{\theta}, \sigma_{\theta}$ are neural networks with shared parameters.

They are nice, because they have easy inverse.
$z=\left[\begin{array}{c}z_{1: j} \\ \frac{\left(z_{j+1: d}^{\prime}-\mu_{\theta}\left(z_{1 ; j}\right)\right)}{\sigma_{\theta}\left(z_{1: j}\right)}\end{array}\right]$

The jacobian is also easy to compute as it has triangluar structure
$\frac{\partial z^{\prime}}{\partial z}=\left[\begin{array}{cc}\mathbb{I}_{d} & 0 \\ \frac{\partial z_{j+1: d}^{\prime}}{\partial z_{1: j}} & \operatorname{diag}\left(\sigma_{\theta}\left(z_{1: j}\right)\right)\end{array}\right]$

The log determinant is $\sum_{j} \log \sigma_{\theta}\left(z_{j}\right)$

VAE with Normalizing Flows

Paper: Rezende and Mohamed, Variational Inference with Normalizing Flows

In Variational Autoencoders, we use approximate posterior with gaussian prior. We can combine NFs to which can return highly complex densities instead of simple priors.

The evidence lower bound is

\operatorname{ELBO}_{\theta, \varphi}(x)=\log p(x)-\operatorname{KL}\left(q_{\varphi}(z \mid x) \| p(z \mid x)\right)

Replace the simple approximate posterior by normalizing flows

\mathbb{E}_{q_{0}\left(z_{0} \mid x\right)}\left[\log p_{\theta}\left(x \mid z_{K}\right)\right]-\operatorname{KL}\left(q_{0}\left(z_{0} \mid x\right) \| p(z)\right)+\mathbb{E}_{q_{0}}\left(z_{0} \mid x\right)\left[\sum_{k=1}^{K} \log \left|\operatorname{det} \frac{d f_{k}}{d z_{k}}\right|^{-1}\right]

With this combination, the KL term in the ELBO vanishes and we approach true posterior.

Effect of number of transformations

Has significant effects on the final density

Normalizing flows in images

Normalizing flows are continuous transformations, but images contain discrete values.
So a NF model will assign $\delta$ -peak probabilities on integer (pixel) values only. These probabilities will be nonsensical, there is no smoothness. We have to convert these peak probabilities to a continuous one.

(Variational) Dequantization

Dequantization ensures that the input values are continuously distributed. If the step is left out, the model will place all the probability mass on the discrete values, resulting in $\delta$ -peaks around the discrete values. This means that we can't interpret these probabilities as likelihoods (since they are all infinite at the point are only taking finite values if integrated over), rendering the output useless for further processing.

We do this by addinf a (continuous) noise $u \sim q(u \mid x)$ to input variables $v=x+\underline{u}$ . The data log-likelihood is then changed to integrate out the noise variable:

\log p(x)=\log \int p(x+u) d u=\log \mathbb{E}_{u \sim q(u \mid x)}\left[\frac{p(x+u)}{q(u \mid x)}\right] \geq \mathbb{E}_{u \sim q(u \mid x)} \log \left[\frac{p(x+u)}{q(u \mid x)}\right]

If $q(u \mid x)$ is the uniform distribution then we have standard dequantization

Probability between two consecutive values is fixed, resemble boxy boundaries between values

Better learn $$q(u|x)$$ in a variational manner, which is called Variational dequantization.

Splitting dimensions for coupling layer

Use masking: Checkers pattern or Splitting across channels

We have to alternate dimensions between consecutive layers -> not always the same $$1: d$$ dimensions remain untouched.

Multi-scale architecture

One disadvantage of normalizing flows is that they operate on the exact same dimensions as the input. If the input is high-dimensional, so is the latent space, which requires larger computational cost to learn suitable transformations. However, particularly in the image domain, many pixels contain less information in the sense that we could remove them without loosing the semantical information of the image. Based on this intuition, deep normalizing flows on images commonly apply a multi-scale architecture. After the first N flow transformations, we split off half of the latent dimensions and directly evaluate them on the prior. The other half is run throughN more flow transformations, and depending on the size of the input, we split it again in half or stop overall at this position. The two operations involved in this setup is Squeeze and Split.

Squeeze opration: Ex: The input of 4×4×1 is scaled to 2×2×4 following the idea of grouping the pixels in 2×2×1 subsquares.
Squeeze_operation

The split operation then divides the input into two parts, and evaluates one part directly on the prior.

Advantages and disadvantages

Starting from a simple density like a unit Gaussian we can obtain any complex density that match our data without even knowing its analytic form
Tractable density estimation
Efficient parallel sampling and learning
Often very many transformations required, so very large networks needed
Constrained to invertible transformations with tractable determinant
Tied encoder and decoder weights
Transformations cannot easily introduce bottlenecks, we have to resort to solution like multi-scale archtecture.

References

What are normalizing flows video by Ari Seff https://www.youtube.com/watch?v=i7LjDvsLWCg&amp%3Bt=490s
Some good blogs on flows: https://maurocamaraescudero.netlify.app/ai-blogs/
VAEs with normalizing flows https://arxiv.org/abs/1505.05770v6
Lilian Weng's post https://lilianweng.github.io/lil-log/2018/10/13/flow-based-deep-generative-models.html
Tutorial on NFs https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial11/NF_image_modeling.html
Normalizing flow models notes CS236 https://deepgenerativemodels.github.io/notes/flow/#fnref:nf
Lecture 11.3, UvA Deep Learning Course 2020