# The EM algorithm
### Jensen's inequality
Let $f$ be a concave function and let $X$ be a random variable. Then:
$$
E[f(X)] \leq f(E[X])
$$
### EM
Suppose we have N observations, and we wish to estimate the parameters for a model that maximizes the following log likelihood:
$$\begin{align}
l(\theta) &= \sum_{i=1}^{N} \log p(x_i;\theta) \\
\mathcal l(\theta) &= \sum_{i=1}^{N} \log \sum_{z_i}p(x_i, z_i;\theta)
\end{align}
$$
Since estimating $p(x;\theta)$ directly might be hard, we introduce a latent variable Z. The assumption here is that Z is an underlying random variable that influences our observations. Therefore, when Z is observed, the calculation of the maximum likelihood becomes easier (i.e., our observations become "complete"). But since Z is unobserved, we cannot directly calculate the complete data-log likelihood $p(X,Z;\theta)$. Instead, we wish to iteratively construct a lower bound by finding its expected value for a fixed set of $\theta$ (E-step). Once we do this, we can find a new set of $\theta$ that optimizes that lower bound (M-step).
Consider some auxiliary distribution $q$ over Z ($\sum_z q(z_i) = 1$ and $q(z) \geq 0$). Using Jensen's inequality we can obtain the lower bound as follows:
$$\begin{equation}\begin{aligned}
\sum_{i} \log p(x_i; \theta) &= \sum_i \log \sum_{z_i} p(x_i, z_i;\theta) \\
&= \sum_i \log \sum_{z_i} q(z_i) \frac{p(x_i, z_i;\theta)}{q(z_i)} \\
&\geq \sum_i \sum_{z_i} q(z_i) \log \frac{p(x_i, z_i;\theta)}{q(z_i)} \\
&= \sum_i E_{z_i \sim q} \log \frac{p(x_i, z_i;\theta)}{q(z_i)} \\
&=\mathcal{L(q;\theta)}
\end{aligned}\end{equation}
$$
That is, we can rewrite the marginal log likelihood as the expectation over the complete data log likelihood $p(X,Z;\theta)$ with respect to the auxiliary distribution $q$.
We want to choose $q$ to be a function that makes the lower bound tight for a given $\theta$. To do this, notice that we can get equality if we take the expectation over some constant value c. I.e., $\frac{p(x_i, z_i;\theta)}{q(z_i)} = c$ and $q \propto p(x, z; \theta)$. Rewriting $q$, we can then obtain the following:
$$
\begin{equation}\begin{aligned}
q(z_i) &= \frac{p(x_i, z_i; \theta)}{\sum_z p(x_i,z_i;\theta)} \\
&= \frac{p(x_i, z_i; \theta)}{p(x_i;\theta)} \\
&= p(z_i|x_i;\theta)
\end{aligned}\end{equation}
$$
That is, setting $q$ to be the posterior distribution of X given Z, we obtain a lower bound $\mathcal{L}(q;\theta)$ that is equal to the incomplete likelihood $p(x;\theta)$ at a given value of $\theta$. This is the E-step of EM.
For the M-step, we fix $q$ from the previous E-step, and perform Maximum Likelihood Estimation to obtain a new set of values for $\theta$:
$$
\theta^{new} = \operatorname{argmax}_{\theta} \mathcal{L}(q;\theta).
$$
### Relation to KL divergence
Let's consider the difference between the true marginal likelihood and our lower bound estimate:
\begin{align}
\log p(x_i;\theta) - \mathcal{L(q;\theta)} &=
\log p(x_i;\theta) - E_{z_i \sim q} \log \frac{p(x_i, z_i;\theta)}{q(z_i)} \\
&= - E_{q(z_i)} log \frac{p(x_i, z_i;\theta)}{p(x_;\theta) q(z_i)} \\
&= - E_{q(z_i)} \log \frac{p(z_i|x_i;\theta)}{q(z_i)} \\
&= E_{q(z_i)} \log \frac{q(z_i)}{p(z_i
\vert x_i;\theta)} \\
&= KL(q(z_i) \| p(z_i|x_i;\theta))
\end{align}
We see that the lower bound is equal to the marginal likelihood when the KL divergence between $q_{z_i}$ and the true posterior $p(z_i|x_i;\theta)$ is 0. As we showed previously, we can get the KL to be zero, when setting $q_{z_i}=p(z_i|x_i;\theta)$. This is exactly what we do at each E-step. However, when we update our parameters $\theta$ in the next M-step - assuming we are not at a local maximum yet - for our new estimate $\theta^{new}$ the KL term becomes non-zero, as shown below (Bishop p.452).

We will then repeat the E-step, and update $q(z_i)$ according to the $p(z_i|x_i;\theta^{new})$.
The EM algorithm keeps alternating between calculating the lower bound, and maximizing this bound until we converge to the maximum of the marginal likelihood.

## Guaranteed improvement under exact E-step
If the E-step is exact, meaning we can set $q(x_i)$ to be exactly the posterior $p(z_i\vert x_i)$, then each full iteration of the EM algorithm is guaranteed to not decrease the likelihood. This can be seen by realising that if $q(x_i)=p(z_i\vert x_i)$, the lower bound is tight at $\theta_old$, so $\mathcal{L}(q, \theta_{old}) = \log p(z_i)$
]
## Example: k-means algorithm