Pathwise Gradient Estimator

Created December 15, 2020 · Updated March 18, 2026

Also known as the 'reparameterization trick'.

Often the probability density can be rewritten as a deterministic function of a simpler probability density. Pathwise estimators work by transforming simple random samples (like standard normal) into samples from complex distributions using a deterministic function.

\widehat{x} \sim p_{\varphi}(x) \Leftrightarrow \widehat{x}=g(\hat{\varepsilon}, \varphi), \hat{\varepsilon} \sim p(\varepsilon)

Because of this, now the stochasticity flows through a simple probability density. And, complexity flows from the deterministic transformation. For NN it means backprop - for deterministic functions only- is possible.

At the heart of this method is the change of variables formula

p_{\varphi}(\boldsymbol{x})=p(\boldsymbol{\varepsilon})\left|\operatorname{det} \nabla_{\boldsymbol{\varepsilon}} g(\boldsymbol{\varepsilon}, \varphi)\right|^{-1}

We have seen Normalizing Flows using the same property.

Deriving the pathwise gradient estimator

As a use case the following expectation from VAE: $\nabla_{\varphi} \mathbb{E}_{\mathbf{z} \sim q_{\varphi}(z \mid x)}[\log p(x \mid z)]$

We also have, $z=g(\varepsilon, \varphi \mid x)=\mu_{x}+\varepsilon \cdot \sigma_{x},$ where $\varphi=\left(\mu_{x}, \sigma_{x}\right) \Rightarrow d z=\sigma_{x} d \varepsilon$
and $\left|\operatorname{det} \nabla_{\varepsilon} g(\varepsilon, \varphi \mid x)\right|=\sigma_{x}$

Now using pathwise gradient estimation,

\begin{array}{l} \nabla_{\varphi} \mathbb{E}_{\mathbf{z} \sim q_{\varphi}(\mathbf{z} \mid x)}[\log p(x \mid \mathbf{z})]= \\ =\nabla_{\varphi} \int_{\mathbf{z}} q_{\varphi}(z \mid x) \log p(x \mid z) d z \\ =\nabla_{\varphi} \int_{\mathbf{z}} \frac{1}{\sigma_{x}} p(\varepsilon) \log p(x \mid g(\varepsilon, \varphi \mid x)) \sigma_{x} d \varepsilon \\ =\int_{\varepsilon} p(\varepsilon) \nabla_{\varphi} \log p(x \mid g(\varepsilon, \varphi \mid x)) d \varepsilon \\ =\mathbb{E}_{\varepsilon \sim p(\varepsilon)}\left[\nabla_{\varphi} \log p(x \mid g(\varepsilon, \varphi \mid x))\right] \\ =\frac{1}{n} \sum_{i} \nabla_{\varphi} \log p\left(x \mid g\left(\varepsilon^{(i)}, \varphi \mid x\right)\right), \varepsilon^{(i)} \sim p(\varepsilon) \end{array}

Properties

No need to know the pdf explicitly: Only the deterministic transformation and the base sampling distribution.

They require differentiable cost functions: Pathwise estimators work by reparameterizing x = g(ε, θ) where ε is parameter-free noise, then computing:

∇_θ E[f(x)] = E[∇_θ f(g(ε, θ))] = E[∇_x f(x) · ∇_θ g(ε, θ)]

This requires ∇_x f(x) to exist, so f(x) must be differentiable.

REINFORCE - Score Function Estimator circumvent this by using the identity: ∇_θ E[f(x)] = E[f(x) ∇_θ log p(x|θ)]. Here, gradients are only taken of the log probability p(x|θ), while the cost function f(x) appears as a multiplicative weight in the expectation. Since f(x) is never differentiated—only the log probability is—f(x) can be any function: discontinuous, discrete-valued, or non-differentiable.

Low variance in general

Lower than the REINFORCE - Score Function Estimator
Example: if you compare the VAE score-function and pathwise gradients, the score-function has an extra multiplicative term, which increases variance.
$\frac{1}{n} \sum_{i} \log p\left(x \mid \mathbf{z}^{(i)}\right) \nabla_{\varphi} \log q_{\varphi}\left(\mathbf{z}^{(i)} \mid x\right) \quad \frac{1}{n} \sum_{i} \nabla_{\varphi} \log p\left(x \mid g\left(\varepsilon^{(i)}, \varphi\right)\right)$
Very efficient: This is the reason why they were proposed in VAE. Even a single sample suffices no matter dimensionality.