Auto-Encoding Variational Bayes
===
**Authors**: Durk Kingma, Max Welling
**Year**: 2014
**Link**: https://arxiv.org/pdf/1312.6114.pdf
## Background
Adding home-court advantage in [TrueSkill, Minka et al 2006](https://papers.nips.cc/paper/2006/file/f44ee263952e65b3610b8ba51229d1f9-Paper.pdf).

Goal is to estimate the lantet variables (i.e. skills $\mu, \sigma$), given the observed data (i.e. win/loss events).
$p(z \mid x)$
* This model is quite complex.
* Exact inference is not possible -- posterior is intractable.
* All those truncated distributions at the end...
Variational methods offer a solution:
* Devise an approximate model $q_\phi(z \mid x)$ (aka recognition model) that is well known and tractable.
* Find parameters $\phi$ which minimizes $\text{KL}(q \mid\mid p)$.
* In many cases, you cannot minimize the KL directly, and so you optimize for a different formulation.
This is *NOT* straightforward.
* You have to make very conscious choices about the form of $q$ so that it remains tractable.
* Even then, all updates have to be derived manually.
* Update derivations for relatively trivial problems are fairly involved ... [Clutter Problem, Minka's thesis, page 21](https://tminka.github.io/papers/ep/minka-thesis.pdf)
What else?
* Minimizing the KL is equivalent to maximizing the evidence lower-bound -- you can show that log-likelihood of the data equals to ELBO + KL ([Understanding the Variatioal Lower-Bound](http://users.umiacs.umd.edu/~xyang35/files/understanding-variational-lower.pdf)).
$$
L = \text{log }p(X) - \text{KL}[q(z) \mid \mid p(Z \mid X)]
$$
* ELBO is defined as
$$
L = \text{E}_q[\text{log }p(X, Z)] + H[Z] \\
H[Z] = - \text{E}_q[\text{log }q(Z)]
$$
BlackBox inference using gradients?
* Why not take gradients, and update using SGD?
## AEVB
AEVB is a general and flexible framework for automatic variational inference.
## Setup
You have dataset $X$ consisting of $N$ i.i.d. samples. Each point $x^i$ can be continuous or discrete.
Data is generated in a 2 step process:
1. A value $z^i$ is sampled from the latent variable $z$ with prior probability $p(z)$.
2. A value $x^i$ is generated according to conditional probability distribution $p(x \mid z)$.
$z$ is unseen; all parameters in the densities are unknown; if $p(x \mid z)$ is a neural network, then its weights are the parameters $\theta$.
You want to estimate the parameters, and the latent variables.
AEVB makes *NO* simplifying assumptions about the posterior $p(z \mid x)$, or the marginal $p(x)$.
Interested in general algo that works in the following cases:
1. Intractability: marginal $p(x)$ is intractable. Posterior is intractable. And any integrals required for mean-field approaches are also intractable.
2. Large datasest: full batch optimization is not possible. Complex sampling methods also not possible. Would like to work with small mini-batches.
$$
p(z \mid x) = \frac{p(x \mid z) p(z)}{p(x)}
$$
$p(z \mid x)$ is intractable. Introduce an approximation of the true posterior $q_{\phi}(z \mid x)$. In VI, you ensure that $q_{\phi}$ can be factorized into some well-known distributions which are tractable (i.e. closed form updates). AEVB assume the form is intractable and cannot be factored easily.
## Why Encoder-Decoder?
> From coding theory perspective, $z$ can be interpreted as a latent representation, or a *code*. Hence, the approximation $q(z \mid x)$ can be thought of as an *encoder* -- given datapoint $x$, the encoder generates a distribution over values of $z$ from which the datapoint $x$ could have been generated.
> Similarly, we refer to $p(x \mid z)$ as a *decoder* that generates a distribution over $x$.
In effect, we are trying to estimate the parameters and the latent variables by:
1. Estimating the (distribution over) the lantent variables.
2. Sampling from the latent distribution.
3. Reconstructing the sample using the likelihood $p(x \mid z)$
## ELBO
We want to find $q_\phi$ that is closest to $p_{\theta}(z \mid x)$. KL divergence gives is one measure. But minimizing it directly is not possible.

We only care about Eq.3. Expectation is w.r.t $q_\phi$: $\sum_{z} q_\phi (z \mid x)\ \text{log}\ p_\theta(x \mid z)$.
**How to interpret this formulation?**
The 2nd term is log-likelihood of the given input, for the $z$ that we have sampled. This term will be maximized only when the highest probability is assigned to the original/true value of $x$. Think of this as the *reconstruction error*.
The first term (KL divergence) is a form of a regularizer.

There is a chance that the network learns to *copy* the input in the *code*.
* Limit the number of units -- forced to learn only the most representative features.
* Corrupt the input, but use the original in the reconstruction -- downsample, noise, etc.
This term forces the encoding to be similar to the prior distribution -- if your prior $p(z)$ is a Normal, then this will force the codes to resemble Normals.
There are also some other advantages of sampling:
* It automatically acts as a noise inducer -- because you expect similar outputs from nearby samples of a code.
* This is what also ensure a smooth interpolation between two points in the codes space.
## SGVB estimator
We want to take derivatives of the ELBO w.r.t $\theta, \phi$ and then optimise using SGD, Adam etc.
One issue is that $z$ is still a random variable : $z \sim q_\phi(z \mid x)$. The paper proposes tranform to a differentiable function:

This leads to the following approximation of the ELBO:

which can be used with mini-batches, and a single sample L=1.

## Re-parameterization Trick
In some conditions, it is possible to express random variable $z$ as a deterministic variable $z = g_\phi(e, x)$

## AEVB algorithm

* We **did not** talk specifically about NNs. VAE is the most popular instantiation of AEVB, but you can do more.
* The decoder can have any functional form as long as it's differentiable. You can even use standard ditributions: [TF Probability](https://www.tensorflow.org/probability/overview).
## Things I skipped from this paper
* Other ELBO estimators, and their tradeoffs. There is a estimator that does not include KL divergence. Another one does not use the Reparameterization Trick, and has higher variance.
* Suggestions on the deterministic differentiable transform function -- you can use more than just Gaussians as priors.
* Form of the encoder has a diagonal covariance -- assumes all "codes" are independent. This might make some sense for simplistic datasets like imagenet: code0 represents number, code1 represents size, code2 represents thickness, code3 represents tilt, etc ... But might not make sense for others. This also simplifies computation -- encoder output is just 2xK, where K is dim size of z -- first K values are the means, and next K are the std deviations.
* This only works for continuous latent variables -- as with SGD. Paper mentions 1 other paper that has comparable time-complexity and is also applicable for discrete vars.
## Auto-Encoding Sentences!
[Generating Sentences from a Continuous Space, Bowman et al 2016](https://arxiv.org/pdf/1511.06349.pdf)
Can you learn a *global* latent code for sentences? And use this for generating meaningful sentences?
