InfoMax derivation of
$β$ -VAE

Some notation:

$p_{D}$ (x): data distribution
$q_{ψ} (z | x)$ : representation distribution
$q_{ψ} (z) = \int p_{D} (x) q_{ψ} (z | x)$ : aggregate posterior - marginal distribution of representation
$Z$
$q_{ψ} (x | z) = \frac{q_{ψ} (z | x) p_{D} (x)}{q_{ψ} (z)}$ : "inverted posterior"

Setup

We'll start from just the representation

q_{ψ} (z | x)

, with no generative model of the data. We'd like this representation to satisfy two properties:

Independence: We'd like the aggregate posterior
$q_{ψ} (z)$ to exhibit coordinate-wise independence, and in particular to be close to a fixed, factoized prior distibution
$p (z) = \prod_{i} p (z_{i})$ .
Maximum Infomation: We'd like the representation
$Z$ to retain as much infomation as possible about the input data
$X$ .

Note that without (1), (2) is insufficient, because then any deterministic and invertible function of

Z

would satisfy 1. Similarly, without (2), (1) is insufficient because

q_{ψ} (z | x) = p (z)

would satisfy (1) but would be a pretty useless representation of the data, since

Z

doesn't depend on

X

at all…

Deriving a practical objective

We can achieve a combination of (1) and (2) by optimizing an objective with the weighted combination of two terms corresponding to the two goals we set out above:

L (ψ) = KL [q_{ψ} (z) ‖ p (z)] - λ I_{q_{ψ} (z | x) p_{D} (x)} [X, Z]

Now we're going to show how this objective can be related to the

β

-VAE objective. Let's look at the first term of this:

\begin{aligned} KL [q_{ψ} (z) ‖ p (z)] & = E_{q_{ψ} (z)} \log \frac{q_{ψ} (z)}{p (z)} \\ = E_{q_{ψ} (z | x) p_{D} (x)} \log \frac{q_{ψ} (z)}{p (z)} \\ = E_{q_{ψ} (z | x) p_{D} (x)} \log \frac{q_{ψ} (z)}{q_{ψ} (z | x)} + E_{q_{ψ} (z | x) p_{D} (x)} \log \frac{q_{ψ} (z | x)}{p (z)} \\ = E_{q_{ψ} (z | x) p_{D} (x)} \log \frac{q_{ψ} (z) p_{D} (x)}{q_{ψ} (z | x) p_{D} (x)} + E_{q_{ψ} (z | x) p_{D} (x)} \log \frac{q_{ψ} (z | x)}{p (z)} \\ = - I_{q_{ψ} (z | x) p_{D} (x)} [X, Z] + E_{p_{D}} KL [q_{ψ} (z | x) ‖ p (z)] \end{aligned}

Putting this back together, we have that

\begin{aligned} L (ψ) & = KL [q_{ψ} (z) ‖ p (z)] - λ I_{q_{ψ} (z | x) p_{D} (x)} [X, Z] \\ = E_{p_{D}} KL [q_{ψ} (z | x) ‖ p (z)] - (λ + 1) I_{q_{ψ} (z | x) p_{D} (x)} [X, Z] \end{aligned}

Now we have the KL-divergence term from the

β

-VAE, we're missing the reconstruction term (and we haven't even defined the generative model

p_{θ} (x | z)

). As we will see we can recover this term, too, by using a variational approximation to the mutual information.

Variational bound on mutual information

Note the following equality:

I [X, Z] = H [X] - H [X | Z]

The first term, the entropy of

X

is constant with respect to

ψ

, since we sample X from the data distribution

p_{D}

. The second term can be bounded by the cross entropy of any classifier (using Jensen's inequality):

H [X | Z] = - E_{q_{ψ} (z | x) p_{D} (x)} \log q_{ψ} (x | z) \leq inf_{θ} - E_{q_{ψ} (z | x) p_{D} (x)} \log p_{θ} (x | z)

In this step, we intoduce

p_{θ} (x | z)

as an auxilliary distribution to make a variational appoximation to the mutual information.

Putting this bound back together:

L (ψ) \leq E_{p_{D}} KL [q_{ψ} (z | x) ‖ p (z)] - (1 + λ) E_{q_{ψ} (z | x) p_{D} (x)} \log p_{θ} (x | z)

And this is essentially the

β

-VAE objective function, where

β

is related to the previous

λ

Additional ramblings

Conceptually, this is interesting because here, the recognition model

q_{ψ} (z | x)

is now the main object of interest.

The "latent variable model"

q_{ψ} (z | x) p_{D} (x)

parametrizes LVMs which has a marginal distribution on observable

x

that is exactly the same as the data distribution

p_{D}

. So one can say

q_{ψ} (z | x) p_{D} (x)

is a parametric family of latent variable models with whose likelihood is maximal.

We then ask the question, out of models of this form, which one should we choose. The generative model

p_{θ}

is introduced as an axulilliary distribution while constructing a lower bound the mutual information, but that's perhaps not the best way to do this.

So there are two families of joint distributions over latents and observable distributions here. On one hand we have

q_{ψ} (z | x) p_{D} (x)

and on the other we have

p (x) p_{θ} (x | z)

. The

β

-VAE (or just VAE) objective tries to move these two models closer to one another. From the perspective of

q_{ψ} (z | x) p_{D} (x)

this can be understood as trying to maximise mutual information while reproducing the prior

p (z)

. From the perspective of

p (x) p_{θ} (x | z)

it can be understood as trying to maximise the data likelihood, i.e. to reproduce

p_{D}

and, if the

β

-VAE objective is used, to additionally maximise information, too.

This symmetry of variational learning has been noted a few times:
ying-yang machines
adversarially learned inference

InfoMax derivation of β-VAE

Some notation:

Setup

Deriving a practical objective

Variational bound on mutual information

Putting this bound back together:

Additional ramblings

Read more

Reading Group

Új témák MLJC-re

Importrance of Masking in Generative Modeling of Sequences

Periodic Markov-chain example

InfoMax derivation of
$β$ -VAE