# InfoMax derivation of $\beta$-VAE
#### Some notation:
* $p_\mathcal{D}$(x): data distribution
* $q_\psi(z\vert x)$: representation distribution
* $q_\psi(z) = \int p_\mathcal{D}(x)q_\psi(z\vert x)$: aggregate posterior - marginal distribution of representation $Z$
* $q_\psi(x\vert z) = \frac{q_\psi(z\vert x)p_\mathcal{D}(x)}{q_\psi(z)}$: "inverted posterior"
#### Setup
We'll start from just the representation $q_\psi(z\vert x)$, with no generative model of the data. We'd like this representation to satisfy two properties:
1. Independence: We'd like the aggregate posterior $q_\psi(z)$ to exhibit coordinate-wise independence, and in particular to be close to a fixed, factoized prior distibution $p(z) = \prod_i p(z_i)$.
2. Maximum Infomation: We'd like the representation $Z$ to retain as much infomation as possible about the input data $X$.
Note that without (1), (2) is insufficient, because then any deterministic and invertible function of $Z$ would satisfy 1. Similarly, without (2), (1) is insufficient because $q_\psi(z\vert x) = p(z)$ would satisfy (1) but would be a pretty useless representation of the data, since $Z$ doesn't depend on $X$ at all...
#### Deriving a practical objective
We can achieve a combination of (1) and (2) by optimizing an objective with the weighted combination of two terms corresponding to the two goals we set out above:
$$
\mathcal{L}(\psi) = \operatorname{KL}[q_\psi(z)\| p(z)] - \lambda \mathbb{I}_{q_\psi(z\vert x)p_\mathcal{D}(x)}[X, Z]
$$
Now we're going to show how this objective can be related to the $\beta$-VAE objective. Let's look at the first term of this:
\begin{align}
\operatorname{KL}[q_\psi(z)\| p(z)] &= \mathbb{E}_{q_\psi(z)} \log \frac{q_\psi(z)}{p(z)}\\
&= \mathbb{E}_{q_\psi(z\vert x)p_\mathcal{D}(x)} \log \frac{q_\psi(z)}{p(z)}\\
&= \mathbb{E}_{q_\psi(z\vert x)p_\mathcal{D}(x)} \log \frac{q_\psi(z)}{q_\psi(z\vert x)} + \mathbb{E}_{q_\psi(z\vert x)p_\mathcal{D}(x)} \log \frac{q_\psi(z\vert x)}{p(z)}\\
&= \mathbb{E}_{q_\psi(z\vert x)p_\mathcal{D}(x)} \log \frac{q_\psi(z)p_\mathcal{D}(x)}{q_\psi(z\vert x)p_\mathcal{D}(x)} + \mathbb{E}_{q_\psi(z\vert x)p_\mathcal{D}(x)} \log \frac{q_\psi(z\vert x)}{p(z)}\\
&= -\mathbb{I}_{q_\psi(z\vert x)p_\mathcal{D}(x)}[X,Z] + \mathbb{E}_{p_\mathcal{D}}\operatorname{KL}[q_\psi(z\vert x)\| p(z)]
\end{align}
Putting this back together, we have that
\begin{align}
\mathcal{L}(\psi) &= \operatorname{KL}[q_\psi(z)\| p(z)] - \lambda \mathbb{I}_{q_\psi(z\vert x)p_\mathcal{D}(x)}[X, Z]\\
&= \mathbb{E}_{p_\mathcal{D}}\operatorname{KL}[q_\psi(z\vert x)\| p(z)] - (\lambda + 1) \mathbb{I}_{q_\psi(z\vert x)p_\mathcal{D}(x)}[X, Z]\\
\end{align}
Now we have the KL-divergence term from the $\beta$-VAE, we're missing the reconstruction term (and we haven't even defined the generative model $p_\theta(x\vert z)$). As we will see we can recover this term, too, by using a variational approximation to the mutual information.
#### Variational bound on mutual information
Note the following equality:
\begin{equation}
\mathbb{I}[X,Z] = \mathbb{H}[X] - \mathbb{H}[X\vert Z]
\end{equation}
The first term, the entropy of $X$ is constant with respect to $\psi$, since we sample X from the data distribution $p_\mathcal{D}$. The second term can be bounded by the cross entropy of any classifier (using Jensen's inequality):
$$
\mathbb{H}[X\vert Z] = - \mathbb{E}_{q_\psi(z\vert x)p_\mathcal{D}(x)} \log q_\psi(x\vert z) \leq \inf_\theta - \mathbb{E}_{q_\psi(z\vert x)p_\mathcal{D}(x)} \log p_\theta(x\vert z)
$$
In this step, we intoduce $p_\theta(x\vert z)$ as an auxilliary distribution to make a variational appoximation to the mutual information.
#### Putting this bound back together:
$\mathcal{L}(\psi) \leq \mathbb{E}_{p_\mathcal{D}}\operatorname{KL}[q_\psi(z\vert x)\| p(z)] - (1 + \lambda) \mathbb{E}_{q_\psi(z\vert x)p_\mathcal{D}(x)} \log p_\theta(x\vert z)$
And this is essentially the $\beta$-VAE objective function, where $\beta$ is related to the previous $\lambda$.
### Additional ramblings
Conceptually, this is interesting because here, the recognition model $q_\psi(z\vert x)$ is now the main object of interest.
The "latent variable model" $q_\psi(z\vert x)p_\mathcal{D}(x)$ parametrizes LVMs which has a marginal distribution on observable $x$ that is exactly the same as the data distribution $p_\mathcal{D}$. So one can say $q_\psi(z\vert x)p_\mathcal{D}(x)$ is a parametric family of latent variable models with whose likelihood is maximal.
We then ask the question, out of models of this form, which one should we choose. The generative model $p_\theta$ is introduced as an axulilliary distribution while constructing a lower bound the mutual information, but that's perhaps not the best way to do this.
So there are two families of joint distributions over latents and observable distributions here. On one hand we have $q_\psi(z\vert x)p_\mathcal{D}(x)$ and on the other we have $p(x)p_\theta(x\vert z)$. The $\beta$-VAE (or just VAE) objective tries to move these two models closer to one another. From the perspective of $q_\psi(z\vert x)p_\mathcal{D}(x)$ this can be understood as trying to maximise mutual information while reproducing the prior $p(z)$. From the perspective of $p(x)p_\theta(x\vert z)$ it can be understood as trying to maximise the data likelihood, i.e. to reproduce $p_\mathcal{D}$ and, if the $\beta$-VAE objective is used, to additionally maximise information, too.
This symmetry of variational learning has been noted a few times:
[ying-yang machines](https://link.springer.com/chapter/10.1007/978-3-662-07952-2_22)
[adversarially learned inference](https://arxiv.org/abs/1606.00704)