Auto-Encoding Variational Bayes === **Authors**: Durk Kingma, Max Welling **Year**: 2014 **Link**: https://arxiv.org/pdf/1312.6114.pdf ## Background Adding home-court advantage in [TrueSkill, Minka et al 2006](https://papers.nips.cc/paper/2006/file/f44ee263952e65b3610b8ba51229d1f9-Paper.pdf). ![](https://i.imgur.com/BiQAReZ.png) Goal is to estimate the lantet variables (i.e. skills $\mu, \sigma$), given the observed data (i.e. win/loss events). $p(z \mid x)$ * This model is quite complex. * Exact inference is not possible -- posterior is intractable. * All those truncated distributions at the end... Variational methods offer a solution: * Devise an approximate model $q_\phi(z \mid x)$ (aka recognition model) that is well known and tractable. * Find parameters $\phi$ which minimizes $\text{KL}(q \mid\mid p)$. * In many cases, you cannot minimize the KL directly, and so you optimize for a different formulation. This is *NOT* straightforward. * You have to make very conscious choices about the form of $q$ so that it remains tractable. * Even then, all updates have to be derived manually. * Update derivations for relatively trivial problems are fairly involved ... [Clutter Problem, Minka's thesis, page 21](https://tminka.github.io/papers/ep/minka-thesis.pdf) What else? * Minimizing the KL is equivalent to maximizing the evidence lower-bound -- you can show that log-likelihood of the data equals to ELBO + KL ([Understanding the Variatioal Lower-Bound](http://users.umiacs.umd.edu/~xyang35/files/understanding-variational-lower.pdf)). $$ L = \text{log }p(X) - \text{KL}[q(z) \mid \mid p(Z \mid X)] $$ * ELBO is defined as $$ L = \text{E}_q[\text{log }p(X, Z)] + H[Z] \\ H[Z] = - \text{E}_q[\text{log }q(Z)] $$ BlackBox inference using gradients? * Why not take gradients, and update using SGD? ## AEVB AEVB is a general and flexible framework for automatic variational inference. ## Setup You have dataset $X$ consisting of $N$ i.i.d. samples. Each point $x^i$ can be continuous or discrete. Data is generated in a 2 step process: 1. A value $z^i$ is sampled from the latent variable $z$ with prior probability $p(z)$. 2. A value $x^i$ is generated according to conditional probability distribution $p(x \mid z)$. $z$ is unseen; all parameters in the densities are unknown; if $p(x \mid z)$ is a neural network, then its weights are the parameters $\theta$. You want to estimate the parameters, and the latent variables. AEVB makes *NO* simplifying assumptions about the posterior $p(z \mid x)$, or the marginal $p(x)$. Interested in general algo that works in the following cases: 1. Intractability: marginal $p(x)$ is intractable. Posterior is intractable. And any integrals required for mean-field approaches are also intractable. 2. Large datasest: full batch optimization is not possible. Complex sampling methods also not possible. Would like to work with small mini-batches. $$ p(z \mid x) = \frac{p(x \mid z) p(z)}{p(x)} $$ $p(z \mid x)$ is intractable. Introduce an approximation of the true posterior $q_{\phi}(z \mid x)$. In VI, you ensure that $q_{\phi}$ can be factorized into some well-known distributions which are tractable (i.e. closed form updates). AEVB assume the form is intractable and cannot be factored easily. ## Why Encoder-Decoder? > From coding theory perspective, $z$ can be interpreted as a latent representation, or a *code*. Hence, the approximation $q(z \mid x)$ can be thought of as an *encoder* -- given datapoint $x$, the encoder generates a distribution over values of $z$ from which the datapoint $x$ could have been generated. > Similarly, we refer to $p(x \mid z)$ as a *decoder* that generates a distribution over $x$. In effect, we are trying to estimate the parameters and the latent variables by: 1. Estimating the (distribution over) the lantent variables. 2. Sampling from the latent distribution. 3. Reconstructing the sample using the likelihood $p(x \mid z)$ ## ELBO We want to find $q_\phi$ that is closest to $p_{\theta}(z \mid x)$. KL divergence gives is one measure. But minimizing it directly is not possible. ![](https://i.imgur.com/rMPIlkK.png) We only care about Eq.3. Expectation is w.r.t $q_\phi$: $\sum_{z} q_\phi (z \mid x)\ \text{log}\ p_\theta(x \mid z)$. **How to interpret this formulation?** The 2nd term is log-likelihood of the given input, for the $z$ that we have sampled. This term will be maximized only when the highest probability is assigned to the original/true value of $x$. Think of this as the *reconstruction error*. The first term (KL divergence) is a form of a regularizer. ![](https://i.imgur.com/b0AGdWC.png) There is a chance that the network learns to *copy* the input in the *code*. * Limit the number of units -- forced to learn only the most representative features. * Corrupt the input, but use the original in the reconstruction -- downsample, noise, etc. This term forces the encoding to be similar to the prior distribution -- if your prior $p(z)$ is a Normal, then this will force the codes to resemble Normals. There are also some other advantages of sampling: * It automatically acts as a noise inducer -- because you expect similar outputs from nearby samples of a code. * This is what also ensure a smooth interpolation between two points in the codes space. ## SGVB estimator We want to take derivatives of the ELBO w.r.t $\theta, \phi$ and then optimise using SGD, Adam etc. One issue is that $z$ is still a random variable : $z \sim q_\phi(z \mid x)$. The paper proposes tranform to a differentiable function: ![](https://i.imgur.com/FHnW0Lj.png) This leads to the following approximation of the ELBO: ![](https://i.imgur.com/8U3SopM.png) which can be used with mini-batches, and a single sample L=1. ![](https://i.imgur.com/Ck8pr24.png) ## Re-parameterization Trick In some conditions, it is possible to express random variable $z$ as a deterministic variable $z = g_\phi(e, x)$ ![](https://i.imgur.com/teA1iZI.png) ## AEVB algorithm ![](https://i.imgur.com/UzNJdye.png) * We **did not** talk specifically about NNs. VAE is the most popular instantiation of AEVB, but you can do more. * The decoder can have any functional form as long as it's differentiable. You can even use standard ditributions: [TF Probability](https://www.tensorflow.org/probability/overview). ## Things I skipped from this paper * Other ELBO estimators, and their tradeoffs. There is a estimator that does not include KL divergence. Another one does not use the Reparameterization Trick, and has higher variance. * Suggestions on the deterministic differentiable transform function -- you can use more than just Gaussians as priors. * Form of the encoder has a diagonal covariance -- assumes all "codes" are independent. This might make some sense for simplistic datasets like imagenet: code0 represents number, code1 represents size, code2 represents thickness, code3 represents tilt, etc ... But might not make sense for others. This also simplifies computation -- encoder output is just 2xK, where K is dim size of z -- first K values are the means, and next K are the std deviations. * This only works for continuous latent variables -- as with SGD. Paper mentions 1 other paper that has comparable time-complexity and is also applicable for discrete vars. ## Auto-Encoding Sentences! [Generating Sentences from a Continuous Space, Bowman et al 2016](https://arxiv.org/pdf/1511.06349.pdf) Can you learn a *global* latent code for sentences? And use this for generating meaningful sentences? ![](https://i.imgur.com/NmkaDEF.png)