We'll start from just the representation , with no generative model of the data. We'd like this representation to satisfy two properties:
Note that without (1), (2) is insufficient, because then any deterministic and invertible function of would satisfy 1. Similarly, without (2), (1) is insufficient because would satisfy (1) but would be a pretty useless representation of the data, since doesn't depend on at all…
We can achieve a combination of (1) and (2) by optimizing an objective with the weighted combination of two terms corresponding to the two goals we set out above:
Now we're going to show how this objective can be related to the -VAE objective. Let's look at the first term of this:
Putting this back together, we have that
Now we have the KL-divergence term from the -VAE, we're missing the reconstruction term (and we haven't even defined the generative model ). As we will see we can recover this term, too, by using a variational approximation to the mutual information.
Note the following equality:
The first term, the entropy of is constant with respect to , since we sample X from the data distribution . The second term can be bounded by the cross entropy of any classifier (using Jensen's inequality):
In this step, we intoduce as an auxilliary distribution to make a variational appoximation to the mutual information.
And this is essentially the -VAE objective function, where is related to the previous .
Conceptually, this is interesting because here, the recognition model is now the main object of interest.
The "latent variable model" parametrizes LVMs which has a marginal distribution on observable that is exactly the same as the data distribution . So one can say is a parametric family of latent variable models with whose likelihood is maximal.
We then ask the question, out of models of this form, which one should we choose. The generative model is introduced as an axulilliary distribution while constructing a lower bound the mutual information, but that's perhaps not the best way to do this.
So there are two families of joint distributions over latents and observable distributions here. On one hand we have and on the other we have . The -VAE (or just VAE) objective tries to move these two models closer to one another. From the perspective of this can be understood as trying to maximise mutual information while reproducing the prior . From the perspective of it can be understood as trying to maximise the data likelihood, i.e. to reproduce and, if the -VAE objective is used, to additionally maximise information, too.
This symmetry of variational learning has been noted a few times:
ying-yang machines
adversarially learned inference