Importance Sampling

Created September 30, 2021 · Updated November 29, 2025

Importance sampling is a mathematical technique used to estimate properties of a particular probability distribution, while only having samples from a different distribution.

Suppose you want to estimate the expected value of a function $$f(x)$$ under a probability distribution $$p(x)$$ , but you only have samples from a different distribution $$q(x)$$ . Importance sampling can be used to provide an unbiased estimate of this expected value.

Mathematically, the expected value of $$f(x)$$ under $$p(x)$$ is given by:

\mathbb{E}_{p}[f(x)] = \int f(x)p(x)dx

However, you only have samples from $$q(x)$$ . The importance sampling estimator uses a weighted average of the function values at the sampled points, with weights given by the ratio of the probabilities under the two distributions. The estimator is given by:

\hat{\mathbb{E}}_{p}[f(x)] = \frac{1}{N} \sum_{i=1}^{N} \frac{p(x_i)}{q(x_i)} f(x_i)

where $$x_i$$ are samples drawn from $$q(x)$$ , and $$N$$ is the total number of samples.

The ratio $\frac{p(x_i)}{q(x_i)}$ is known as the importance weight and corrects for the fact that the samples are drawn from $$q(x)$$ rather than $$p(x)$$ . This ratio adjusts the contribution of each sample to reflect how likely it would be under the distribution of interest $$p(x)$$ .

An important aspect to consider in importance sampling is the choice of $$q(x)$$ . Ideally, $$q(x)$$ should be close to $$p(x)$$ and should also have a heavy tail if $$f(x)$$ has a heavy tail. If $$q(x)$$ is very different from $$p(x)$$ , the importance weights can vary greatly, leading to high variance in the estimate. This is a common challenge in implementing importance sampling effectively.

Application Examples

Off-policy RL: Evaluate target policy π using data from behavior policy μ:

V^π(s) = \mathbb{E}_μ\left[\frac{π(A_t|S_t)}{μ(A_t|S_t)} G_t | S_t = s\right]

Policy optimization methods:

TRPO - Trust-Region Policy Optimization optimizes surrogate objective with KL constraint:

L(\theta) = \mathbb{E}\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)} A(s,a)\right]

\text{s.t. } \mathbb{E}[D_{KL}(\pi_{\theta_{old}}(\cdot|s) || \pi_\theta(\cdot|s))] \leq \delta

PPO - Proximal Policy Optimization uses clipped importance sampling ratio:

L(\theta) = \mathbb{E}\left[\min\left(\frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)} A(s,a), \text{clip}\left(\frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)}, 1-\epsilon, 1+\epsilon\right) A(s,a)\right)\right]

where

π_θ: The current policy being updated (parameterized by θ)
π_θ_old: The previous policy from the last iteration
π* (optimal policy): What we're ultimately trying to converge to

The process works iteratively:

Collect data using π_θ_old
Use importance sampling to estimate how π_θ would perform on that data
Update θ to improve the policy
Repeat: π_θ becomes the new π_θ_old

The importance sampling ratio π_θ(a|s)/π_θ_old(a|s) lets us reuse old data to evaluate the new policy, avoiding the need to collect fresh data at every step (which would be expensive).

So π_θ is our current best guess at the optimal policy, which we keep improving until hopefully π_θ ≈ π*.