Expectation Maximization

The EM algorithm is a two-stage iterative optimization technique for finding maximum likelihood solutions for probabilistic models having latent variables.

Consider a probabilistic model in which we collectively denote all of the observed variables by $\mathbf{X}$ and all of the hidden variables by $\mathbf{Z}$. The joint distribution $p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})$ is governed by a set of parameters denoted $\boldsymbol{\theta} .$ Our goal is to maximize the likelihood function that is given by

$$ p(\mathbf{X} \mid \boldsymbol{\theta})=\sum_{\mathbf{Z}} p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}) $$

Here, let's assume that the direct optimization of $p(\mathbf{X} \mid \boldsymbol{\theta})$ is difficult, but the optimization using $p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})$ is significantly easier.

Introducing the distribution $q(\mathbf{z})$, from the product rule of probability,

$$ \ln p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})=\ln p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta})+\ln p(\mathbf{X} \mid \boldsymbol{\theta}) $$

Then we have that the following decomposition holds always

$$ \ln p(\mathbf{X} \mid \boldsymbol{\theta})=\mathcal{L}(q, \boldsymbol{\theta})+\mathrm{KL}(q \| p) $$

where we have defined
$$ \begin{aligned} \mathcal{L}(q, \boldsymbol{\theta}) &=\sum_{\mathbf{Z}} q(\mathbf{Z}) \ln \left\{\frac{p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})}{q(\mathbf{Z})}\right\} \\ \mathrm{KL}(q \| p) &=-\sum_{\mathbf{Z}} q(\mathbf{Z}) \ln \left\{\frac{p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta})}{q(\mathbf{Z})}\right\} \end{aligned} $$

Notice the difference in sign and the conditional/joint. Given $\mathrm{KL}(q \| p) \geqslant 0$ $\mathcal{L}(q, \theta)$ is a lower bound of the maximum log likelihood.
Screenshot 2020-10-08 at 3.33.42 PM

In the E step, the lower bound $\mathcal{L}\left(q, \boldsymbol{\theta}^{\text {old }}\right)$ is maximized with respect to $q(\mathbf{Z})$ while holding $\boldsymbol{\theta}^{\text {old }}$ fixed. The solution to this maximization problem is easily seen by noting that the value of $\ln p\left(\mathbf{X} \mid \boldsymbol{\theta}^{\mathrm{old}}\right)$ does not depend on $q(\mathbf{Z})$ and so the largest value occurs when the KL divergence vanishes i.e. when $q(\mathbf{Z})$ is equal to the posterior $p\left(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{\mathrm{old}}\right)$.

In the M step, $q(\mathbf{Z})$ is fixed and the. lower bound $\mathcal{L}(q, \boldsymbol{\theta})$ is maximized with respect to $\boldsymbol{\theta}$ to give some new value $\boldsymbol{\theta}^{\text {new }}$. This causes the lower bound to increase, which will cause the log likelihood function to increase, and then again q will not equal the new posterior and hence there will be nonzero KL divergence. Substituting $q(\mathbf{Z})=p\left(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{\text {old }}\right)$, we see that after each E, the lower bound takes the form,

$$ \begin{aligned} \mathcal{L}(q, \boldsymbol{\theta}) &=\sum_{\mathbf{Z}} p\left(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{\text {old }}\right) \ln p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})-\sum_{\mathbf{Z}} p\left(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{\text {old }}\right) \ln p\left(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{\text {old }}\right) \\ &=\mathcal{Q}\left(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text {old }}\right)+\text { const } \end{aligned} $$

where the constant is simply the negative entropy of q and is therefore independent of $\theta$. Thus in the M step, the quantity that is being maximized is the expectation of the complete-data log likelihood.

For iid data, X will comprise of N data points while Z will comprise N latent variables $\left\{\mathbf{z}_{n}\right\},$ where $n=1, \ldots, N$. So, the posterior probability in E step takes the form

$$ p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta})=\frac{p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})}{\sum_{\mathbf{Z}} p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})}=\frac{\prod_{n=1}^{N} p\left(\mathbf{x}_{n}, \mathbf{z}_{n} \mid \boldsymbol{\theta}\right)}{\sum_{\mathbf{Z}} \prod_{n=1}^{N} p\left(\mathbf{x}_{n}, \mathbf{z}_{n} \mid \boldsymbol{\theta}\right)}=\prod_{n=1}^{N} p\left(\mathbf{z}_{n} \mid \mathbf{x}_{n}, \boldsymbol{\theta}\right) $$

Thus the algorithm is given as,

Initialize parameters $\boldsymbol{\theta}$
Until convergence, repeat
Set $\boldsymbol{\theta^{old}}$ = $\boldsymbol{\theta}$

  1. Expectation Step: $\mathcal{Q}\left(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text {old }}\right) = \mathbb{E}_[ p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta})]$ i.e. fix the parameters and calculate the posterior (also called responsibility)
  2. Maximization step: $\boldsymbol{\theta} = \underset{\boldsymbol{\theta}}{\text{argmax}} \mathcal{Q}\left(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text {old }}\right)$ i.e fix the posterior and update the parameters

Properies of EM

From our above discussion, it follows that EM has the following
properties:

  • The marginal likelihood increases after each EM cycle.
  • since the marginal likelihood is upper-bounded by its true global maximum, and it increases at every step, EM must eventually converge.

However, since we optimizing a non-convex objective, we have no guarantee to find the global optimum. In fact, EM in practice converges almost always to a local optimum, and moreover, that optimum heavily depends on the choice of initialization, which is its main downside.

References

  1. 9.4, Bishop 2006
  2. https://people.cs.pitt.edu/~milos/courses/cs2750-Spring04/lectures/class22.pdf
    1. Stanford CS228 notes on latent variable models https://ermongroup.github.io/cs228-notes/learning/latent/