Stochastic Gradients

Created December 26, 2020 · Updated August 3, 2025

Stochastic gradients are gradient estimates computed from random samples, used when dealing with stochastic objectives - objectives that inherently involve randomness.

Examples include:

Variational Autoencoders: $\mathbb{E}_{q(z|x)}[\log p(x|z)] - \text{KL}(q(z|x)||p(z))$ - gradients involve sampling $$z$$ from encoder
Policy Gradient: $\mathbb{E}_{\pi}[R(\tau)]$ - gradients require sampling trajectories $\tau$ from policy $\pi$
Variational Inference: $\mathbb{E}_{q(z)}[\log p(x,z) - \log q(z)]$ - gradients involve sampling from approximate posterior $$q(z)$$

In these cases, the gradients are stochastic because the objective function itself contains random variables. Various techniques like Monte-Carlo Estimation, reparameterization tricks, and REINFORCE - Score Function Estimator are used to estimate these gradients.

Challenges

\eta=\nabla_{\varphi} \mathcal{F}(\varphi)=\nabla_{\varphi} \mathbb{E}_{\boldsymbol{x} \sim p_{\varphi}(\boldsymbol{x})}\left[f_{\theta}(\boldsymbol{x})\right]

$$x$$ is typically high dimensional
The parameters $\varphi$ are often in the order of thousands
The cost function is often not differentiable or even unknown
That is, the expectation (integral) is often intractable, we must estimate it instead, with MC integration.

Desired properties of MC estimators for gradients

Consistency: When sampling more samples the estimator $\hat{y}$ should get closer to the true $$y$$
Unbiasedness: Guarantees convergence of stochastic optimization
Low variance
- Few samples should suffice
- Less jiggling i.e. gradient updates in consistent direction which results in more efficient learning
Computational efficiency: Should be easy to sample and estimate

Stochastic optimization loop

Qualitative comparision between estimators

Pathwise gradients have consistently lower variance

For complex functions the pathwise gradient might have higher variance
pathwise-comparision

Straight-through gradients

Often, gradients are hard or impossible to compute. For instance, if we have binary stochastic variables $z \sim f(x), z \in\{0,1\}$
If we compute the derivative on the sample we would have $\frac{d z}{d x}=0$
$$z$$ is a constant value (not a function).

A popular alternative is straight-through gradients

We set the gradient is $\frac{d z}{d x}=1$
Another alternative is to set the gradient $\frac{d z}{d x}=\frac{d f}{d x}$

However, straight-through gradients introduce bias as our estimated gradient is different from the true gradient.

Variance reduction in deep networks

Control Variates

REBAR (Tucker et al.) - https://arxiv.org/abs/1703.07370

Low variance, unbiased gradient estimates for discrete latent variables
Inspired by REINFORCE and continuous relaxations
Removing the bias from the continuous relaxation

RELAX (Grathwohl et al.) - https://arxiv.org/pdf/1711.00123.pdf

Low variance, unbiased gradient estimates for black box functions
Applicable to discrete and continuous settings

Low bias low variance gradients

Paper: Pervez, Cohen and Gavves, Low Bias Low Variance Gradient Estimates for Hierarchical Boolean Stochastic Neticorks

Existing methods have troubles with deep Boolean stochastic nets

Successive straight-through in multiple layers fails

Efficient but the bias accumulates over multiple layers
Optimization quickly gets stuck and learning stops

Using unbiased estimates (REBAR, RELAX) is too inefficient

Expand boolean networks with harmonic analysis (Fourier)

Bias and variance is caused by higher order coefficients
Manipulates those coefficients to reduce bias and variance

Can train up to 80 layers instead of 2

References

Monte Carlo Gradient Estimation in Machine Learning https://arxiv.org/pdf/1906.10652.pdf (awesome paper)