# The EM algorithm ### Jensen's inequality Let $f$ be a concave function and let $X$ be a random variable. Then: $$ E[f(X)] \leq f(E[X]) $$ ### EM Suppose we have N observations, and we wish to estimate the parameters for a model that maximizes the following log likelihood: $$\begin{align} l(\theta) &= \sum_{i=1}^{N} \log p(x_i;\theta) \\ \mathcal l(\theta) &= \sum_{i=1}^{N} \log \sum_{z_i}p(x_i, z_i;\theta) \end{align} $$ Since estimating $p(x;\theta)$ directly might be hard, we introduce a latent variable Z. The assumption here is that Z is an underlying random variable that influences our observations. Therefore, when Z is observed, the calculation of the maximum likelihood becomes easier (i.e., our observations become "complete"). But since Z is unobserved, we cannot directly calculate the complete data-log likelihood $p(X,Z;\theta)$. Instead, we wish to iteratively construct a lower bound by finding its expected value for a fixed set of $\theta$ (E-step). Once we do this, we can find a new set of $\theta$ that optimizes that lower bound (M-step). Consider some auxiliary distribution $q$ over Z ($\sum_z q(z_i) = 1$ and $q(z) \geq 0$). Using Jensen's inequality we can obtain the lower bound as follows: $$\begin{equation}\begin{aligned} \sum_{i} \log p(x_i; \theta) &= \sum_i \log \sum_{z_i} p(x_i, z_i;\theta) \\ &= \sum_i \log \sum_{z_i} q(z_i) \frac{p(x_i, z_i;\theta)}{q(z_i)} \\ &\geq \sum_i \sum_{z_i} q(z_i) \log \frac{p(x_i, z_i;\theta)}{q(z_i)} \\ &= \sum_i E_{z_i \sim q} \log \frac{p(x_i, z_i;\theta)}{q(z_i)} \\ &=\mathcal{L(q;\theta)} \end{aligned}\end{equation} $$ That is, we can rewrite the marginal log likelihood as the expectation over the complete data log likelihood $p(X,Z;\theta)$ with respect to the auxiliary distribution $q$. We want to choose $q$ to be a function that makes the lower bound tight for a given $\theta$. To do this, notice that we can get equality if we take the expectation over some constant value c. I.e., $\frac{p(x_i, z_i;\theta)}{q(z_i)} = c$ and $q \propto p(x, z; \theta)$. Rewriting $q$, we can then obtain the following: $$ \begin{equation}\begin{aligned} q(z_i) &= \frac{p(x_i, z_i; \theta)}{\sum_z p(x_i,z_i;\theta)} \\ &= \frac{p(x_i, z_i; \theta)}{p(x_i;\theta)} \\ &= p(z_i|x_i;\theta) \end{aligned}\end{equation} $$ That is, setting $q$ to be the posterior distribution of X given Z, we obtain a lower bound $\mathcal{L}(q;\theta)$ that is equal to the incomplete likelihood $p(x;\theta)$ at a given value of $\theta$. This is the E-step of EM. For the M-step, we fix $q$ from the previous E-step, and perform Maximum Likelihood Estimation to obtain a new set of values for $\theta$: $$ \theta^{new} = \operatorname{argmax}_{\theta} \mathcal{L}(q;\theta). $$ ### Relation to KL divergence Let's consider the difference between the true marginal likelihood and our lower bound estimate: \begin{align} \log p(x_i;\theta) - \mathcal{L(q;\theta)} &= \log p(x_i;\theta) - E_{z_i \sim q} \log \frac{p(x_i, z_i;\theta)}{q(z_i)} \\ &= - E_{q(z_i)} log \frac{p(x_i, z_i;\theta)}{p(x_;\theta) q(z_i)} \\ &= - E_{q(z_i)} \log \frac{p(z_i|x_i;\theta)}{q(z_i)} \\ &= E_{q(z_i)} \log \frac{q(z_i)}{p(z_i \vert x_i;\theta)} \\ &= KL(q(z_i) \| p(z_i|x_i;\theta)) \end{align} We see that the lower bound is equal to the marginal likelihood when the KL divergence between $q_{z_i}$ and the true posterior $p(z_i|x_i;\theta)$ is 0. As we showed previously, we can get the KL to be zero, when setting $q_{z_i}=p(z_i|x_i;\theta)$. This is exactly what we do at each E-step. However, when we update our parameters $\theta$ in the next M-step - assuming we are not at a local maximum yet - for our new estimate $\theta^{new}$ the KL term becomes non-zero, as shown below (Bishop p.452). ![](https://i.imgur.com/KWhlOAT.png) We will then repeat the E-step, and update $q(z_i)$ according to the $p(z_i|x_i;\theta^{new})$. The EM algorithm keeps alternating between calculating the lower bound, and maximizing this bound until we converge to the maximum of the marginal likelihood. ![](https://i.imgur.com/iUErzQ6.png) ## Guaranteed improvement under exact E-step If the E-step is exact, meaning we can set $q(x_i)$ to be exactly the posterior $p(z_i\vert x_i)$, then each full iteration of the EM algorithm is guaranteed to not decrease the likelihood. This can be seen by realising that if $q(x_i)=p(z_i\vert x_i)$, the lower bound is tight at $\theta_old$, so $\mathcal{L}(q, \theta_{old}) = \log p(z_i)$ ] ## Example: k-means algorithm