Stochastic Gradients
Stochastic gradients are gradient estimates computed from random samples, used when dealing with stochastic objectives - objectives that inherently involve randomness.
Examples include:
- Variational Autoencoders: $\mathbb{E}_{q(z|x)}[\log p(x|z)] - \text{KL}(q(z|x)||p(z))$ - gradients involve sampling $z$ from encoder
- Policy Gradient: $\mathbb{E}_{\pi}[R(\tau)]$ - gradients require sampling trajectories $\tau$ from policy $\pi$
- Variational Inference: $\mathbb{E}_{q(z)}[\log p(x,z) - \log q(z)]$ - gradients involve sampling from approximate posterior $q(z)$
In these cases, the gradients are stochastic because the objective function itself contains random variables. Various techniques like Monte-Carlo Estimation, reparameterization tricks, and REINFORCE - Score Function Estimator are used to estimate these gradients.
Challenges
- $x$ is typically high dimensional
- The parameters $\varphi$ are often in the order of thousands
- The cost function is often not differentiable or even unknown
- That is, the expectation (integral) is often intractable, we must estimate it instead, with MC integration.
Desired properties of MC estimators for gradients
- Consistency: When sampling more samples the estimator $\hat{y}$ should get closer to the true $y$
- Unbiasedness: Guarantees convergence of stochastic optimization
- Low variance
- Few samples should suffice
- Less jiggling i.e. gradient updates in consistent direction which results in more efficient learning
- Computational efficiency: Should be easy to sample and estimate
Stochastic optimization loop
Qualitative comparision between estimators
- Pathwise gradients have consistently lower variance

For complex functions the pathwise gradient might have higher variance

Straight-through gradients
Often, gradients are hard or impossible to compute. For instance, if we have binary stochastic variables $z \sim f(x), z \in\{0,1\}$
If we compute the derivative on the sample we would have $\frac{d z}{d x}=0$
$z$ is a constant value (not a function).
A popular alternative is straight-through gradients
- We set the gradient is $\frac{d z}{d x}=1$
- Another alternative is to set the gradient $\frac{d z}{d x}=\frac{d f}{d x}$
However, straight-through gradients introduce bias as our estimated gradient is different from the true gradient.
Variance reduction in deep networks
REBAR (Tucker et al.) - https://arxiv.org/abs/1703.07370
- Low variance, unbiased gradient estimates for discrete latent variables
- Inspired by REINFORCE and continuous relaxations
- Removing the bias from the continuous relaxation
RELAX (Grathwohl et al.) - https://arxiv.org/pdf/1711.00123.pdf
- Low variance, unbiased gradient estimates for black box functions
- Applicable to discrete and continuous settings
Low bias low variance gradients
Paper: Pervez, Cohen and Gavves, Low Bias Low Variance Gradient Estimates for Hierarchical Boolean Stochastic Neticorks
Existing methods have troubles with deep Boolean stochastic nets
Successive straight-through in multiple layers fails
- Efficient but the bias accumulates over multiple layers
- Optimization quickly gets stuck and learning stops
Using unbiased estimates (REBAR, RELAX) is too inefficient
Expand boolean networks with harmonic analysis (Fourier)
- Bias and variance is caused by higher order coefficients
- Manipulates those coefficients to reduce bias and variance
Can train up to 80 layers instead of 2
References
- Monte Carlo Gradient Estimation in Machine Learning https://arxiv.org/pdf/1906.10652.pdf (awesome paper)