Maximum Likelihood Estimation

Maximum likelihood principle states that the most likely explanation of data D is given by $W_{ML}$ which maximizes the likelihood function.

$$ W_{ML} = \underset{\mathbf{w}}{\arg \max } p(D|w) $$

Let's assume the data is i.i.d. This means the joint probability reduces to the product of individual PDFs. (For correlated data ex. time series, we can't assume this)

$$ p(D \mid \mathbf{w})=p\left(x_{1}, x_{2}, \ldots, x_{N} \mid \mathbf{w}\right)=\prod_{i=1}^{N} p\left(x_{i} \mid \mathbf{w}\right) $$

So, maximum likelihood estimation is given as:

$$ \mathbf{w}_{\mathrm{ML}}=\underset{\mathbf{w}}{\arg \max } p(D \mid \mathbf{w})=\underset{\mathbf{w}}{\arg \max } \prod_{i=1}^{N} p\left(x_{i} \mid \mathbf{w}\right) $$

Since logarithm is monotonically increasing function of its argument, maximizing log of a function is equivalent to maximization of function itself. Using logarithm also helps with preventing numerical underflow as product of small numbers grows smaller very fast. So,

$$ \mathbf{w}_{\mathrm{ML}}=\underset{\mathbf{w}}{\arg \max } \sum_{i=1}^{N} \log p\left(x_{i} \mid \mathbf{w}\right) $$

We find analytical solutions of ML estimates of parameters by taking the derivative of likelihood function with respect to the parameters and setting them to zero.

MLE of Gaussian Distribution

Let's assume the dataset consists of i.i.d Gaussian distributed real variables. Then we the likelihood function is given as the product of individual PDFs,

$$ p(x \mid \mathbf{w})=\mathcal{N}\left(x \mid \mu, \sigma^{2}\right) = p\left(D \mid \mu, \sigma^{2}\right)=\frac{1}{\left(2 \pi \sigma^{2}\right)^{N / 2}} \prod_{i=1}^{N} \exp \left[-\frac{1}{2 \sigma^{2}}\left(x_{i}-\mu\right)^{2}\right] $$

Now, using log of the the distribution,

$$ \begin{align} \log p\left(D \mid \mu, \sigma^{2}\right)&=\log \left(2 \pi \sigma^{2}\right)^{-N / 2}+\sum_{i=1}^{N} \log \exp \left[-\frac{1}{2 \sigma^{2}}\left(x_{i}-\mu\right)^{2}\right] \\ &= -\frac{N}{2} \log \left(2 \pi \sigma^{2}\right)+\sum_{i=1}^{N}-\frac{1}{2 \sigma^{2}}\left(x_{i}-\mu\right)^{2} \end{align} $$

Now for maximizing likelihood for $\mu$, we take the derivative wtih respect to $\mu$,

$$ \begin{align} \frac{\partial}{\partial \mu} \log p\left(D \mid \mu, \sigma^{2}\right) &= 0 \\ \frac{1}{2\sigma^2} \sum_{i=1^N}2(x_i - \mu) &= 0 \\ \sum_{i=1}^N(x_i - \mu) &= 0 \\ \sum_{i=1}^N\mu &= \sum_{i=1^N}x_i \\ N.\mu &= \sum_{i=1}^N x_i \\ \mu &= \frac{1}{N}\sum_{i=1}^N x_i \end{align} $$

which is the sample mean. Therefore, $\mu_{ML}$ is equal to the sample mean.

Now, maximizing the likelihood for $\sigma^2$, we take the derivative with respect to $\sigma^2$,

$$ \frac{\partial}{\partial \sigma^{2}} \log p\left(D \mid \mu, \sigma^{2}\right) = -\frac{N}{2}\frac{1}{2\pi\sigma^2}.2\pi + \frac{1}{2\sigma^4}\sum_{i=1}{N}(x_i - \mu^2) = 0 $$

Multiplying by $2\sigma^2$,

$$ \begin{align} -N\sigma^2 + \sigma_{i=1}^{N}(x_i - \mu^2) &= 0 \\ \sigma^2 &= \frac{1}{N}\sum_{i=1}^N (x - \mu^2) \end{align} $$

which is the sample variance. Therefore, $\sigma^2_{ML}$ is equal to the sample variance.

MLE of Binomial Distribution

The likelihood function

$$ f(m \mid n, p)=\left(\begin{array}{c} n \\ m \end{array}\right) p^{m}(1-p)^{n-m} $$

Find extremum using the log-likelihood.

$$ \frac{d}{d p} \ln f(m \mid n, p)=\frac{m}{p}-\frac{n-m}{1-p}=\Longleftrightarrow m(1-p)=p(n-m) \Longleftrightarrow m=p n \Longleftrightarrow p=\frac{m}{n} $$

Bias in Maximum Likelihood Estimators

Bias is the difference in the statistic of the estimator and the true statistic. To check MLE for bias, let's do a sanity check.

Let's suppose we draw two datasets $D_1$ and $D2$ from a Gaussian distribution. We know that $\mu_{ML}$ and $\mu^2_{ML}$ are the functions of dataset values.

Now, for case of $\mu_{ML}$,

$$ \begin{align} \mathbb{E}_{D \sim p\left(D \mid \mu, \sigma^{2}\right)}\left[\mu_{M L}\right] &=\mathbb{E}\left[\frac{1}{N} \sum_{i=1}^{N} x_{i}\right] \\ &= \frac{1}{N} \sum_{i=1}^{N} \mathbb{E}_{D \sim p\left(D \mid \mu, \sigma^{2}\right)}[x_{i}] \\ &= \frac{1}{N} \sum_{i=1}^{N} \mathbb{E}_{x_i \sim p\left(x_i \mid \mu, \sigma^{2}\right)}[x_{i}] \\ &= \frac{1}{N} \sum_{i=1}^{N} \mu \\ &= \mu \end{align} $$

Therefore, bias of the estimator $\mathbb{E}[\mu_{ML}] - \mu = 0$

In case of variance,

$$ \begin{align} \mathbb{E}_{D \sim p(D \mid \mu, \sigma)}\left[\sigma_{M L}^{2}\right]&=E\left[\frac{1}{N} \sum_{i=1}^{N}\left(x_{i}-\frac{1}{N} \sum_{n=1}^{N} x_{n}\right)^{2}\right] \\ &=\frac{1}{N} \sum_{i=1}^{N} E\left[\left(x i-\frac{1}{N} \sum_{n=1}^{N} x_{n}\right)^{2}\right]\\ &= \frac{1}{N} \sum_{i=1}^{N} E\left[x_{i}^{2}-\frac{2 x_{i}}{N} \sum_{n=1}^{N} x_{n}+\frac{1}{N^{2}} \sum_{m=1}^{N} \sum_{n=1}^{N} x_{m} x_{n}\right]\\ &=\frac{1}{N} \sum_{i=1}^{N}\left\{E\left[x_{i}^{2}\right]-\frac{2}{N} \sum_{n=1}^{N} \mathbb{E}\left[x_{i} x_{n}\right]+\frac{1}{N^{2}} \sum_{m=1}^{N} \sum_{n=1}^{N} E\left[x_{m} x_{n}\right]\right\} \end{align} $$

Now,

$$ \begin{align} E[x_i,x_i] = cov[x_i] \Rightarrow &E[x_i^2] - E[x_i]^2 = \sigma^2 \\ & E[x_i^2] = \mu^2 + \sigma^2\\ \\ E[x_i,x_j] = cov[x_i,x_j] \Rightarrow &E[x_i.x_j] - E[x_i]E[x_j] = 0 \\ & E[x_i.x_j] = \mu^2 & \end{align} $$

Using these values in the expression above,

$$ \begin{align} \mathbb{E}_{D \sim p(D \mid \mu, \sigma)}\left[\sigma_{M L}^{2}\right]&= \frac{1}{N} \sum_{i=1}^{N}\left\{\mu^{2}+\sigma^{2}-\frac{2}{N}\left(N \mu^{2}+\sigma^{2}\right)+\frac{1}{N^{2}}\left(N^{2} \mu^{2}+N \sigma^{2}\right)\right\} \\ &= \frac{N-1}{N}\sigma^2 \end{align} $$

Therefore, bias of the estimator is $\frac{N-1}{N}$.

This means that MLE behaves in a biased way when N is small, but as $N\Rightarrow\infty$ the bias is reduced. This can also be adjusted by apply the transformation of $\frac{N}{N-1}$.