Importance Sampling
Importance sampling is a mathematical technique used to estimate properties of a particular probability distribution, while only having samples from a different distribution.
Suppose you want to estimate the expected value of a function$f(x)$under a probability distribution $p(x)$, but you only have samples from a different distribution $q(x)$. Importance sampling can be used to provide an unbiased estimate of this expected value.
Mathematically, the expected value of$f(x)$under $p(x)$ is given by:
However, you only have samples from $q(x)$. The importance sampling estimator uses a weighted average of the function values at the sampled points, with weights given by the ratio of the probabilities under the two distributions. The estimator is given by:
where $x_i$are samples drawn from $q(x)$, and $N$ is the total number of samples.
The ratio$\frac{p(x_i)}{q(x_i)}$is known as the importance weight and corrects for the fact that the samples are drawn from $q(x)$rather than $p(x)$. This ratio adjusts the contribution of each sample to reflect how likely it would be under the distribution of interest $p(x)$.
An important aspect to consider in importance sampling is the choice of$q(x)$. Ideally, $q(x)$ should be close to $p(x)$and should also have a heavy tail if $f(x)$has a heavy tail. If$q(x)$is very different from $p(x)$, the importance weights can vary greatly, leading to high variance in the estimate. This is a common challenge in implementing importance sampling effectively.
Application Examples
Off-policy RL: Evaluate target policy π using data from behavior policy μ:
Policy optimization methods:
TRPO - Trust-Region Policy Optimization optimizes surrogate objective with KL constraint:
PPO - Proximal Policy Optimization uses clipped importance sampling ratio:
where
- π_θ: The current policy being updated (parameterized by θ)
- π_θ_old: The previous policy from the last iteration
- π* (optimal policy): What we're ultimately trying to converge to
The process works iteratively:
- Collect data using π_θ_old
- Use importance sampling to estimate how π_θ would perform on that data
- Update θ to improve the policy
- Repeat: π_θ becomes the new π_θ_old
The importance sampling ratio π_θ(a|s)/π_θ_old(a|s) lets us reuse old data to evaluate the new policy, avoiding the need to collect fresh data at every step (which would be expensive).
So π_θ is our current best guess at the optimal policy, which we keep improving until hopefully π_θ ≈ π*.