Monte-Carlo Estimation

Motivation

We are often interested to compute quantities (statistics) on random variables. For ex,

  • the average response to a drug
  • probability of a particular sum when throwing two dice
  • average reconstructions in VAE given an input.

These statistics often intractable to compute.

  • Cannot derive a perfect drug response model (too complex)
  • Cannot enumerate all possible dice combinations (too lazy)
  • Computationally infeasible (intractable integrals)

Monte Carlo Integration

Use random sampling instead of analytical computation. A single random sample might not be enough. Many random samples can give us a reliable quantification. E.g, by throwing dice many times we can obtain a histogram of probabilities for each possible sum. If we throw dice once, the histogram will be very wrong (just a single bar). But if we repeat hundreds of times and average, we are gonna get very close (law of large numbers).

More formally, in MC integration we treat inputs $x$ as random variables with pdf $p(x)$. Our desired statistic $y$ is the output and integrate over all possible $x$

$$ y=\int_{x} f(x) p(x) d x $$

This integral is equivalent to an expectation

$$ y = \int_{x} f(x) p(x) d x = \mathbb{E}_{x \sim p(x)}[f(x)] $$

We can approximate this expectation by random sampling and summation

$$ y=\mathbb{E}_{x \sim p(x)}[f(x)] \approx \frac{1}{n} \sum_{i} f\left(x_{i}\right)=\hat{y} $$
where $x_{i}$ is sampled from $p(x)$.

Note that $\hat{y}$ is an estimator because it only approximately estimates the value of $y$.

If we want to compute a quantity $y$ that we can express it as an integral of a function $f$ over a probability space $x$ that has a known and easy to sample pdf $p(x)$ we can replace the exact but intractable computation with a tractable $\mathrm{MC}$ estimator

$$ y=\mathbb{E}_{x \sim p(x)}[f(x)] \approx \frac{1}{n} \sum_{i} f\left(x_{i}\right), x_{i} \sim p(x) $$

If we can't translate the quantity as such an integral, we can't estimate it. For instance, we cannot use $\mathrm{MC}$ on the following because neither the $\log p(x \mid z)$ nor the $\nabla_{\varphi} q_{\varphi}(z \mid x)$ are probability densities:

$$ \nabla_{\varphi} \mathbb{E}_{\mathbf{z} \sim q_{\varphi}(\mathbf{z} \mid x)}[\log p(\boldsymbol{x} \mid \mathbf{z})]=\int_{Z} \log p(\boldsymbol{x} \mid \mathbf{z}) \nabla_{\varphi} q_{\varphi}(\mathbf{z} \mid \boldsymbol{x}) d \mathbf{z} $$

Estimator mean and variance

Our estimator is itself a random variable $\rightarrow$. It has its own mean $\mu_{\hat{y}}=\mathbb{E}[\hat{y}]$ and variance $\operatorname{Var}[\hat{y}]=\mathbb{E}\left[\left(\hat{y}-\mu_{\hat{y}}\right)^{2}\right]$.

The higher the variance, the more the estimation fluctuates after every new experiment.

Estimator bias

An estimator is unbiased if its expected value equals the true parameter. For example, the sample mean x̄ is an unbiased estimator of the population mean μ because E[x̄] = μ.

$$ \mathbb{E}[\hat{y}]=y $$

Otherwise, biased with bias

$$ \text { bias }=\mathbb{E}[\hat{y}]-y $$

It is better to have unbiased estimators, but in some cases a bit of bias is ok. Trade tractability for less accurate solutions (than what could be).

MC estimators are unbiased due to linearity of expectation.

For estimating $\mathbb{E}[f(X)]$ using $\hat{\theta} = \frac{1}{n}\sum_{i=1}^n f(X_i)$:

$$ \mathbb{E}[\hat{\theta}] = \mathbb{E}\left[\frac{1}{n}\sum_{i=1}^n f(X_i)\right] = \frac{1}{n}\sum_{i=1}^n \mathbb{E}[f(X_i)] = \mathbb{E}[f(X)] $$

The expectation operator "pushes through" the summation and averaging operations, giving us the true value. This holds for any sample size $n$, making the estimator unbiased regardless of how many samples we use.

This is why MC methods are so powerful - you get this theoretical guarantee basically for free, regardless of how complex your function $f(X)$ is or what distribution you're sampling from (though of course it takes a LOT of samples for good accuracy).

Note: The Law of Large Numbers ensures the estimator is also consistent (converges to the true value as $n \to \infty$), but unbiasedness is a separate property.

Standard error of MC estimator

The MC estimator is a sample mean

$$ \mathbb{E}_{x \sim p(x)}[f(x)] \approx \frac{1}{n} \sum_{i} f\left(x_{i}\right) $$

The standard error of a sample mean is

$$ \sigma_{\hat{f}}=\frac{\sigma}{\sqrt{n}} $$

The more samples we take the less the estimator deviates. But the deviation reduces only as $\sqrt{n}$. With $4 x$ more samples we only improve our error two times.