Pathwise Gradient Estimator
Also known as the 'reparameterization trick'.
Often the probability density can be rewritten as a deterministic function of a simpler probability density. Pathwise estimators work by transforming simple random samples (like standard normal) into samples from complex distributions using a deterministic function.
Because of this, now the stochasticity flows through a simple probability density. And, complexity flows from the deterministic transformation. For NN it means backprop - for deterministic functions only- is possible.
At the heart of this method is the change of variables formula
We have seen Normalizing Flows using the same property.
Deriving the pathwise gradient estimator
As a use case the following expectation from VAE: $\nabla_{\varphi} \mathbb{E}_{\mathbf{z} \sim q_{\varphi}(z \mid x)}[\log p(x \mid z)]$
We also have, $z=g(\varepsilon, \varphi \mid x)=\mu_{x}+\varepsilon \cdot \sigma_{x},$ where $\varphi=\left(\mu_{x}, \sigma_{x}\right) \Rightarrow d z=\sigma_{x} d \varepsilon$
and $\left|\operatorname{det} \nabla_{\varepsilon} g(\varepsilon, \varphi \mid x)\right|=\sigma_{x}$
Now using pathwise gradient estimation,
Properties
No need to know the pdf explicitly: Only the deterministic transformation and the base sampling distribution.
They require differentiable cost functions: Pathwise estimators work by reparameterizing x = g(ε, θ) where ε is parameter-free noise, then computing:
∇_θ E[f(x)] = E[∇_θ f(g(ε, θ))] = E[∇_x f(x) · ∇_θ g(ε, θ)]
This requires ∇_x f(x) to exist, so f(x) must be differentiable.
REINFORCE - Score Function Estimator circumvent this by using the identity: ∇_θ E[f(x)] = E[f(x) ∇_θ log p(x|θ)]. Here, gradients are only taken of the log probability p(x|θ), while the cost function f(x) appears as a multiplicative weight in the expectation. Since f(x) is never differentiated—only the log probability is—f(x) can be any function: discontinuous, discrete-valued, or non-differentiable.
Low variance in general
- Lower than the REINFORCE - Score Function Estimator
- Example: if you compare the VAE score-function and pathwise gradients, the score-function has an extra multiplicative term, which increases variance.
$$ \frac{1}{n} \sum_{i} \log p\left(x \mid \mathbf{z}^{(i)}\right) \nabla_{\varphi} \log q_{\varphi}\left(\mathbf{z}^{(i)} \mid x\right) \quad \frac{1}{n} \sum_{i} \nabla_{\varphi} \log p\left(x \mid g\left(\varepsilon^{(i)}, \varphi\right)\right) $$
Very efficient: This is the reason why they were proposed in VAE. Even a single sample suffices no matter dimensionality.