Expectation Maximization
The EM algorithm is a two-stage iterative optimization technique for finding maximum likelihood solutions for probabilistic models having latent variables.
Consider a probabilistic model in which we collectively denote all of the observed variables by $\mathbf{X}$ and all of the hidden variables by $\mathbf{Z}$. The joint distribution $p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})$ is governed by a set of parameters denoted $\boldsymbol{\theta} .$ Our goal is to maximize the likelihood function that is given by
Here, let's assume that the direct optimization of $p(\mathbf{X} \mid \boldsymbol{\theta})$ is difficult, but the optimization using $p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})$ is significantly easier.
Introducing the distribution $q(\mathbf{z})$, from the product rule of probability,
Then we have that the following decomposition holds always
where we have defined
Notice the difference in sign and the conditional/joint. Given $\mathrm{KL}(q \| p) \geqslant 0$ $\mathcal{L}(q, \theta)$ is a lower bound of the maximum log likelihood.
In the E step, the lower bound $\mathcal{L}\left(q, \boldsymbol{\theta}^{\text {old }}\right)$ is maximized with respect to $q(\mathbf{Z})$ while holding $\boldsymbol{\theta}^{\text {old }}$ fixed. The solution to this maximization problem is easily seen by noting that the value of $\ln p\left(\mathbf{X} \mid \boldsymbol{\theta}^{\mathrm{old}}\right)$ does not depend on $q(\mathbf{Z})$ and so the largest value occurs when the KL divergence vanishes i.e. when $q(\mathbf{Z})$ is equal to the posterior $p\left(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{\mathrm{old}}\right)$.
In the M step, $q(\mathbf{Z})$ is fixed and the. lower bound $\mathcal{L}(q, \boldsymbol{\theta})$ is maximized with respect to $\boldsymbol{\theta}$ to give some new value $\boldsymbol{\theta}^{\text {new }}$. This causes the lower bound to increase, which will cause the log likelihood function to increase, and then again q will not equal the new posterior and hence there will be nonzero KL divergence. Substituting $q(\mathbf{Z})=p\left(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{\text {old }}\right)$, we see that after each E, the lower bound takes the form,
where the constant is simply the negative entropy of q and is therefore independent of $\theta$. Thus in the M step, the quantity that is being maximized is the expectation of the complete-data log likelihood.
For iid data, X will comprise of N data points while Z will comprise N latent variables $\left\{\mathbf{z}_{n}\right\},$ where $n=1, \ldots, N$. So, the posterior probability in E step takes the form
Thus the algorithm is given as,
Initialize parameters $\boldsymbol{\theta}$
Until convergence, repeat
Set $\boldsymbol{\theta^{old}}$ = $\boldsymbol{\theta}$
- Expectation Step: $\mathcal{Q}\left(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text {old }}\right) = \mathbb{E}_[ p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta})]$ i.e. fix the parameters and calculate the posterior (also called responsibility)
- Maximization step: $\boldsymbol{\theta} = \underset{\boldsymbol{\theta}}{\text{argmax}} \mathcal{Q}\left(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text {old }}\right)$ i.e fix the posterior and update the parameters
Properies of EM
From our above discussion, it follows that EM has the following
properties:
- The marginal likelihood increases after each EM cycle.
- since the marginal likelihood is upper-bounded by its true global maximum, and it increases at every step, EM must eventually converge.
However, since we optimizing a non-convex objective, we have no guarantee to find the global optimum. In fact, EM in practice converges almost always to a local optimum, and moreover, that optimum heavily depends on the choice of initialization, which is its main downside.
References
- 9.4, Bishop 2006
- https://people.cs.pitt.edu/~milos/courses/cs2750-Spring04/lectures/class22.pdf
-
- Stanford CS228 notes on latent variable models https://ermongroup.github.io/cs228-notes/learning/latent/