The EM algorithm

# The EM algorithm ### Jensen's inequality Let $f$ be a concave function and let $X$ be a random variable. Then: $$ E[f(X)] \leq f(E[X]) $$ ### EM Suppose we have N observations, and we wish to estimate the parameters for a model that maximizes the following log likelihood: $$\begin{align} \mathcal l(\theta) &= \sum_{i=1}^{N} log p(x;\theta) \\ \mathcal l(\theta) &= \sum_{i=1}^{N} log \sum_{z}p(x, z;\theta) \end{align} $$ Since estimating $p(x;\theta)$ directly might be hard, we introduce a latent variable Z. The assumption here is that Z is an underlying random variable that influences our observations. Therefore, when Z is observed, the calculation of the maximum likelihood becomes easier (i.e., our observations become "complete"). But since Z is unobserved, we instead iteratively find its expected value (E-step) and construct a lower bound (M-step). Consider some auxiliary distribution $q$ over Z ($\sum_z q(z_i) = 1$ and $q(z) \geq 0$). Using Jensen's inequality we can then obtain the lower bound as follows: $$\begin{equation}\begin{aligned} \sum_{i} log p(x_i; \theta) &= \sum_i log \sum_{z_i} p(x_i, z_i;\theta) \\ &= \sum_i log \sum_{z_i} q(z_i) \frac{p(x_i, z_i;\theta)}{q(z_i)} \\ &\geq \sum_i \sum_{z_i} q(z_i) \frac{p(x_i, z_i;\theta)}{q(z_i)} \\ &=\mathcal{L(x;\theta)} \end{aligned}\end{equation} $$ where $\sum_{z_i} q(z_i)=E_q[z_i]$. We want to choose $q$ to be a function that makes the lower bound tight for a given $\theta$. To do this, notice that we can get equality if we take the expectation over some constant value c. I.e., $\frac{p(x_i, z_i;\theta)}{q(z_i)} = c$ making $q \propto p(x, z; \theta)$. Rewriting $q$, we can then obtain the following: $$ \begin{equation}\begin{aligned} q(z_i) &= \frac{p(x_i, z_i; \theta)}{\sum_z p(x_i,z_i;\theta)} \\ &= \frac{p(x_i, z_i; \theta)}{p(x_i;\theta)} \\ &= p(z_i|x_i;\theta) \end{aligned}\end{equation} $$ That is, setting $q$ to be the posterior distribution of X given Z, we obtain a lower bound $\mathcal{L}(x;\theta)$ that is equal to the "incomplete" likelihood $p(x;\theta)$ at a given value of $\theta$. This is the E-step of EM. For the M-step, we fix $q$ from the previous E-step, and perform Maximum Likelihood Estimation to obtain a new set of values for $\theta$ by maximizing $\mathcal{L}(x;\theta)$. ### Relation to KL divergence Let's consider the difference between the true marginal likelihood and our lower bound estimate: $$ \begin{equation}\begin{aligned} log p(x_i;\theta) - \mathcal{L(x_i;\theta)} &= log p(x_i;\theta) - E_{q(z_i)}[\frac{p(x_i, z_i;\theta)}{q(z_i)}] \\ &= - E_{q(z_i)} \frac{p(x_i, z_i;\theta)}{p(x_;\theta) q(z_i)} \\ &= - E_{q(z_i)} \frac{p(z_i|x_i;\theta)}{q(z_i)} \\ &= E_{q(z_i)} \frac{q(z_i)}{p(z_i|x_i;\theta)} \\ &= KL(q(z_i) || p(z_i|x_i;\theta)) \end{aligned}\end{equation} $$ We see that the lower bound is equal to the marginal likelihood when the KL divergence between $q_{z_i}$ and the true posterior $p(z_i|x_i;\theta)$ is 0. As we showed previously, we can get the KL to be zero, when setting $q_{z_i}=p(z_i|x_i;\theta)$.