Variational autoencoders 1 - the basics
==
$$\DeclareMathOperator\supp{supp}
\DeclareMathOperator*{\argmax}{arg\,max}$$
The Variational Autoencoder (VAE) is a not-so-new-anymore Latent Variable Model ([Kingma & Welling, 2014](https://arxiv.org/pdf/1312.6114.pdf)), which by introducing a probabilistic interpretation of autoencoders, allows to not only estimate the variance/uncertainty in the predictions, but also to inject domain knowledge through the use of informative priors, and possibly to make the latent space more interpretable. VAEs have various applications:
- density estimation and sampling from the estimated density, i.e., data generation (e.g., image/sound/text generation and _missing data imputation_)[^1]
- semi-supervised learning
- representation learning for downstream tasks
## VAEs are generative models ##
Unlike classical (sparse, denoising, etc.) autoencoders, VAEs are probabilistic *generative* models, like GANs (Generative Adversarial Networks). With *generative* we denote a model which learns the probability distribution $p(\mathbf{x})$ over the input space $\mathcal{X}$. This means that after training such a model, we can then sample from our approximation of $p(\mathbf{x})$. If our training set contains handwritten digits ([MNIST](http://yann.lecun.com/exdb/mnist/)), then the trained generative model is able to create images which look like handwritten digits, even though they're not "copies" of the images in the training set. In the case of images, if our training set contains natural images ([CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html)) or celebrity faces ([CelebA](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html)), it will generate images which look like photo pics.
Learning the distribution of the images in the training set implies that images which look like handwritten digits (for example) have an high probability of being generated, while images which look like the Jolly Roger or random noise have a low probability. In other words, it means learning about the dependencies among pixels: if our image is a $28\times 28=784$ pixels grayscale image from MNIST, the model should learn that if a pixel is very bright, then there's a significant probability that some neighboring pixels are bright too, that if we have a long, slanted line of bright pixels we may have another smaller, horizontal line of pixels above this one (a 7), etc.

## VAEs are Latent Variable models ##
The VAE is a _Latent Variable Model_ (LVM): this means that $\mathbf{x}$, the random vector of the 784 pixel intensities (the _observed_ variables), is modeled as a (possibly very complicated) function of a random vector $\mathbf{z}\in\mathcal{Z}$ of lower dimensionality, whose components are unobserved (_latent_) variables. Coincisely, we assume a data generating process with the following DAG (Directed Acyclic Graph):
$$ \begin{align}
\mathbf{z} &\sim p_{\theta^*}(\mathbf{z}) \\
\mathbf{x}\vert\mathbf{z} &\sim p_{\theta^*}(\mathbf{x}\vert\mathbf{z})
\end{align}$$

And we assume that both the prior $p_{\theta^*}(\mathbf{z})$ and the likelihood $p_{\theta^*}(\mathbf{x}\vert\mathbf{z})$ come from some parametric families (possibly different).
Latent variable models are also sometimes called _hierarchical_ or _multilevel_ models, and they are models that use the rules of conditional probability to specify complicated probability distributions over high dimensional spaces, by composing simpler probability density functions. The original VAE has only two levels, but recently [a paper on a deep hierarchial VAE](https://arxiv.org/abs/1902.02102) with multiple levels of latent variables has been published (note that the hierarchy of latent variables in a probabilistic model has nothing to do with the layers of a neural network).
When does a latent variable model make intuitive sense? For example, in the MNIST case we think that the handwritten digits belong to a manifold of dimension much smaller than the dimension of $\mathbf{x}$, because the vast majority of random arrangements of 784 pixel intensities, don't look at all like handwritten digits. The representation of the image should be equivariant to certain transformations (e.g., rotations, translations and small deformations) but not to others. So in this case it makes sense to think that the image samples are generated by taking samples (_which we don't observe_) in a sample space of much smaller dimension, and then transforming them according to some complicated function.
## Estimating the model
We are now given a dataset consisting of $N$ i.i.d. samples, $\mathbf{X}=\{\mathbf{x}_i\}_{i=1}^N$ (note again that we do **not** observe $\mathbf{z}$) and we want to estimate $\theta^*$. The standard "recipe" would be to maximize the likelihood of the data (marginalized over the latent variables), i.e., Maximum Likelihood Estimation:
$$ \hat{\theta}=\argmax_{\theta\in\Theta}p_{\theta}(\mathbf{X}) =\argmax_{\theta\in\Theta}\sum_{i=1}^N\log{p_{\theta}(\mathbf{x}_i)} $$
However, we're immediately faced with two challenges:
1. computing $p_{\theta}(\mathbf{x}_i)$ requires solving the integral
$$ p_{\theta}(\mathbf{x}_i) = \int p_{\theta}(\mathbf{x}_i\vert\mathbf{z})p_{\theta}(\mathbf{z})d\mathbf{z} \label{a}\tag{1} $$
which is often intractable [^2], even in the simple case of one hidden layer nonlinear neural networks.
2. we need to compute the integral for _all_ $N$ data samples $\mathbf{x}_i$ of a large dataset, which rules out either batch optimization, or sampling-based solutions such as [Monte Carlo EM](https://arxiv.org/abs/1206.4768), which would require an expensive sampling step for each $\mathbf{x}_i$.
We need to get _waaay_ smarter.
### Enter Variational Inference ###
To solve the first problem, i.e., the intractability of the marginal likelihood, we use an age-old (in ML years) trick: Variational Inference (introduced in [Jordan et al., 1999](https://people.eecs.berkeley.edu/~jordan/papers/variational-intro.pdf) and nicely reviewed in [Blei et al., 2018](https://arxiv.org/pdf/1601.00670.pdf)). VI transforms the inference problem into an optimization problem. This way, we get rid of integration, but we pay a price, as we'll see later. **Fair warning**: this section is more mathy than the rest. I'll introduce necessary concepts such as the [Kullback-Leibler divergence](https://en.wikipedia.org/wiki/Kullback-Leibler_divergence), but you may want to skip to the next session, for a first read.
#### The Kullback-Leibler divergence
VI starts by introducing a family $\mathcal{Q}$ of parametric approximations $q_{\phi}(\mathbf{z}\vert\mathbf{x})$ to the true posterior distribution $p_{\theta}(\mathbf{z}\vert\mathbf{x})$, indexed by $\phi$. The goal of VI is to find the value(s) of $\phi$ such that $q_{\phi}(\mathbf{z}\vert\mathbf{x})\in\mathcal{Q}$ is "closest" to $p_{\theta}(\mathbf{z}\vert\mathbf{x})$ in some sense. We only assume that $q_{\phi}(\mathbf{z}\vert\mathbf{x})$ is differentiable with respect to $\phi$ (_black box VI_), but we don't make further assumptions such as for example assuming a fully factorized form for $q_{\phi}$, as in the case of mean field variational inference.
The similarity measure we we seek to minimize is the _Kullback-Leibler divergence_
$$D_{KL}(q_{\phi}(\mathbf{z}\vert\mathbf{x})\vert\vert p_{\theta}(\mathbf{z}\vert\mathbf{x}))=\int q_{\phi}(\mathbf{z}\vert\mathbf{x}) \log{\frac{q_{\phi}(\mathbf{z}\vert\mathbf{x})}{p_{\theta}(\mathbf{z}\vert\mathbf{x})}}d\mathbf{z} $$
which is a similarity measure for probabilities distributions. For the following, we'll need two properties of $D_{KL}$:
1. $D_{KL}(q\vert\vert p)\ge0 \ \forall p,q$
2. $D_{KL}(q\vert\vert p) = 0 \iff p = q \ \text{a.e.}$
#### Some ELBO grease
However, our _primary_ goal is to estimate the generative model parameters $\theta^*$ through Maximum (Marginal) Likelihood Estimation. Thus, let's try to rewrite the marginal log-likelihood of a data point in terms of $D_{KL}(q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\vert\vert p_{\theta}(\mathbf{z}\vert\mathbf{x}_i))$:
$$ \begin{multline}\log{p_{\theta}(\mathbf{x}_i)} = \log{p_{\theta}(\mathbf{x}_i)}\int q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)d\mathbf{z} = \int q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\log{p_{\theta}(\mathbf{x}_i)} d\mathbf{z} = \\ \int q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\log{\frac{p_{\theta}(\mathbf{x}_i, \mathbf{z})}{p_{\theta}(\mathbf{z}\vert\mathbf{x}_i)}} d\mathbf{z} = \int q_{\phi}(\mathbf{z}\vert\mathbf{x}_i) \log{\frac{p_{\theta}(\mathbf{x}_i, \mathbf{z})}{p_{\theta}(\mathbf{z}\vert\mathbf{x}_i)}\frac{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}} d\mathbf{z} \end{multline}$$
where in the second-to-last inequality we used Bayes' rule, and the last equality holds if $\supp {p_{\theta}(\mathbf{x}_i, \mathbf{z})}\subset\supp{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}$ (otherwise it becomes an inequality). We can now see a term similar to $D_{KL}(q_{\phi}(\mathbf{z}\vert\mathbf{x})\vert\vert p_{\theta}(\mathbf{z}\vert\mathbf{x}))$ inside the integral, so let's pop it out:
$$ \begin{multline}\log{p_{\theta}(\mathbf{x}_i)} = \int q_{\phi}(\mathbf{z}\vert\mathbf{x}_i) \log{\frac{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}{p_{\theta}(\mathbf{z}\vert\mathbf{x}_i)}\frac{p_{\theta}(\mathbf{x}_i, \mathbf{z})}{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}}d\mathbf{z} = \\ \int q_{\phi}(\mathbf{z}\vert\mathbf{x}_i) \log{\frac{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}{p_{\theta}(\mathbf{z}\vert\mathbf{x}_i)}}d\mathbf{z}+\int q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\log{\frac{p_{\theta}(\mathbf{x}_i, \mathbf{z})}{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}} d\mathbf{z}=\\ D_{KL}(q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\vert\vert p_{\theta}(\mathbf{z}\vert\mathbf{x}_i))+ \mathcal{L}(\phi,\theta;\mathbf{x}_i) \end{multline}$$
Summarizing:
$$ \log{p_{\theta}(\mathbf{x}_i)} = D_{KL}(q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\vert\vert p_{\theta}(\mathbf{z}\vert\mathbf{x}_i))+ \mathcal{L}(\phi,\theta;\mathbf{x}_i) \label{b}\tag{2}$$
The term $\mathcal{L}(\phi,\theta;\mathbf{x}_i)$ is called the _**E**vidence_ (or _variational_) _**L**ower **BO**und_ (_**ELBO**_ for friends), because it's always _no greater than_ the marginal likelihood (or _evidence_) for datapoint $\mathbf{x}_i$, since $D_{KL}(q\vert\vert p)\ge0$:
$$ \log{p_{\theta}(\mathbf{x}_i)} \geq \mathcal{L}(\phi,\theta;\mathbf{x}_i) $$
Thus, maximizing the ELBO goes into the direction of maximizing the marginal likelihood, our original goal. However, the ELBO doesn't contain the pesky marginal evidence, which is what made the problem intractable.
Dropping the $D_{KL}(q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\vert\vert p_{\theta}(\mathbf{z}\vert\mathbf{x}_i))$ and summing on all data points, the learning objective of **BBVI** (Black Box Variational Inference) becomes
$$ \max_{\theta}\sum_{i=1}^N\max_{\phi}\mathcal{L}(\phi,\theta;\mathbf{x}_i) \label{c}\tag{3}$$
Summarizing, we learn a LVM by maximizing the ELBO with respect to both the model parameters $\theta$ and the variational parameters $\phi_i$ for each data point $\mathbf{x}_i$.
**Note** that until now, we haven't mentioned either neural networks or VAE. The approach has been very general, and it could apply to any Latent Variable Model which has the DAG representation shown above.
The ELBO can be written as
$$\int q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\log{\frac{p_{\theta}(\mathbf{x}_i, \mathbf{z})}{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}} d\mathbf{z}=\mathbb{E}_{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}[\log{p_{\theta}(\mathbf{x}_i, \mathbf{z})}-\log{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}] \label{d}\tag{4}$$
or alternatively:
$$\begin{multline} \int q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\log{\frac{p_{\theta}(\mathbf{x}_i, \mathbf{z})}{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}} d\mathbf{z}=\int q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\log{\frac{p_{\theta}(\mathbf{x}_i| \mathbf{z})p_{\theta}(\mathbf{z})}{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}} d\mathbf{z} = \\ \int q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\log{p_{\theta}(\mathbf{x}_i\vert\mathbf{z})}d\mathbf{z} - \int q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\log{\frac{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}{p_{\theta}(\mathbf{z})}} d\mathbf{z} =\\
= \mathbb{E}_{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}[\log{p_{\theta}(\mathbf{x}_i\vert\mathbf{z})}] - D_{KL}(q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\vert\vert p_{\theta}(\mathbf{z}))
\end{multline} \label{e}\tag{5}$$
This last form lends itself to some nice interpretations, as we'll see later[^3].
#### Bend your ELBO
We list here some useful properties of the ELBO. Not all of them will be needed in the following, but we list them anyway as an useful companion for reading other tutorials or papers on VAEs.
- for each data point $\mathbf{x}_i$, the true posterior distribution $p_{\theta}(\mathbf{z}\vert\mathbf{x}_i)$ may or may not belong to $\mathcal{Q}$. If it does, then property 2 of $D_{KL}$ implies that $\mathcal{L}(\phi_i,\theta;\mathbf{x}_i)$ is maximized for $\phi_i=\phi_i^*\mid q_{\phi_i^*}(\mathbf{z}\vert\mathbf{x}_i)=p_{\theta}(\mathbf{z}\vert\mathbf{x}_i)$, and maximizing the ELBO is equivalent to maximizing the marginal likelihood. If it doesn't, then, even for the optimal value $\phi^*_i$, there will be a nonzero difference between the ELBO and the actual marginal likelihood of the data point, equal to $D_{KL}(q_{\phi^*_i}(\mathbf{z}\vert \mathbf{x}_i)\vert\vert p_{\theta}(\mathbf{z}\vert \mathbf{x}_i))$.

###### adapted from https://deepgenerativemodels.github.io/notes/vae/
The residual error is called _approximation gap_ ([Cremer et al., 2017](https://arxiv.org/abs/1801.03558)) and it can only be reduced by expanding the variational family $\mathcal{Q}$ ([Kinga et al., 2016](https://arxiv.org/abs/1606.04934), [Tomczak & Welling, 2017](https://arxiv.org/abs/1706.02326) and many others). However, overly flexible inference models can also have side effects ([Rui et al., 2018](https://arxiv.org/abs/1805.08913)).
- combining Eq.$\ref{c}$ and Eq.$\ref{e}$, we see that maximizing the term $\sum_{i=1}^N\mathbb{E}_{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}[\log{p_{\theta}(\mathbf{x}_i\vert\mathbf{z})}]$ maximizes the probability that, under repeated sampling from $\mathbf{z}\sim q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)$, we generate samples $\mathbf{x}$ which are similar to the training samples $\mathbf{x}_i$. For this reason it's often interpreted as a _reconstruction quality_ (or its opposite is interpreted as a reconstruction loss). The term $-D_{KL}(q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\vert\vert p_{\theta}(\mathbf{z}))$ penalizes the flexibile $q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)$ for being too dissimilar from the prior $p_{\theta}(\mathbf{z})$. In other words, it's a _regularizer_. Thus the ELBO can be interpreted as the sum of a reconstruction error term, and a regularizer term. This is legit, but let's not forget that the goal of VI is still to **maximize the marginal likelihood of the training data**.
### The SGVB estimator and the AEVB algorithm
Introducing BBVI, we solved intractability, but we still need a way to maximize the BBVI objective in a scalable way. To this end, we introduce the Stochastic Gradient-based estimator for the ELBO and its gradient with respect to $\theta$ and $\phi$, called Stochastic Gradient Variational Bayes (SGVB). We want to use stochastic gradient ascent, thus we need the gradient of Eq.$\ref{d}$:
$$ \nabla_{\theta,\phi}\mathbb{E}_{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}[\log{p_{\theta}(\mathbf{x}_i, \mathbf{z})}-\log{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}] $$
The gradient with respect to $\theta$ is immediate:
$$ \mathbb{E}_{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}[\nabla_{\theta}\log{p_{\theta}(\mathbf{x}_i, \mathbf{z})}] $$
and we can estimate the expectation using Monte Carlo.
The gradient with respect to $\phi$ is more badass: since the expectation and the gradient are both with respect to $q_{\phi}$, we cannot simply swap them. As shown in [Mnih & Gregor, 2014](https://www.cs.toronto.edu/~amnih/papers/nvil.pdf), we can prove that
$$ \nabla_{\phi}\mathbb{E}_{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}[\log{p_{\theta}(\mathbf{x}_i, \mathbf{z})}-\log{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}] = \mathbb{E}_{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}[(\log{p_{\theta}(\mathbf{x}_i, \mathbf{z})}-\log{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)})\nabla_{\phi}\log{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}]$$
Again, now that the gradient is inside the expectation, we could use Monte Carlo to estimate it. However, the resulting estimator, called the _score function_ estimator, has high variance. The key contribution of Kingma and Welling, 2014, is an alternative estimator, which is much more well-behaved: the SGVB estimator, based on the _reparametrization trick_. For many differentiable parametric families, it's possible to draw samples of the random variable $\tilde{\mathbf{z}}\sim q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)$ with a two-step generative process:
- generate samples $\boldsymbol{\epsilon} \sim p(\boldsymbol{\epsilon})$ from a simple distribution $p(\boldsymbol{\epsilon})$, such as $\mathcal{N}(0,I)$
- apply a differentiable, deterministic function $g_{\phi}(\boldsymbol{\epsilon}, \mathbf{x})$ to the random noise $\boldsymbol{\epsilon}$. The resulting random variable $\tilde{\mathbf{z}} = g_{\phi}(\boldsymbol{\epsilon}, \mathbf{x})$ is indeed distributed as $q_{\phi}(\mathbf{z}\vert\mathbf{x})$
A classic example is the univariate Gaussian. Let $q_{\phi}(z\vert x) = q_{\mu,\sigma}(z)=\mathcal{N}(\mu,\sigma)$. Then of course if $\epsilon\sim \mathcal{N}(0,1)$ and $g_{\phi}(s)=\mu+\sigma s$, we know that $z=g_{\phi}(\epsilon)\sim\mathcal{N}(\mu,\sigma)$, as desired. There are many other families which can be similarily reparametrized. Another way to see the relevance of the reparametrization trick, is to note that by moving the random node outside the part of the computational graph containing the variational parameters, it allows us to backpropagate through the graph. See this image from a presentation of Durk Kingma:
<p align="center">
<img width="600" src="https://i.imgur.com/xvH1onJ.png">
</p>
The biggest selling point of the reparametrization trick is that we can now write the gradient of the expectation with respect to $q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)$ of any function $f(\mathbf{z})$ as
$$ \nabla_{\phi}\mathbb{E}_{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}[f(\mathbf{z})]=\nabla_{\phi}\mathbb{E}_{p(\boldsymbol{\epsilon})}[f(g_{\phi}(\boldsymbol{\epsilon},\mathbf{x}_i))]=\mathbb{E}_{p(\boldsymbol{\epsilon})}[\nabla_{\phi}f(g_{\phi}(\boldsymbol{\epsilon},\mathbf{x}_i))]$$
Using Monte Carlo to estimate this expectation we obtain an estimator (the SGVB estimator) which has lower variance than the score function estimator[^4], allowing us to learn more complicated models. The reparametrization trick was then extended to discrete variables ([Maddison et al, 2016](https://arxiv.org/abs/1611.00712), [Jang et al., 2016](https://arxiv.org/abs/1611.01144)), which allowed training VAE with discrete latent variables. None of these works, though, closed the performance gap of VAEs with continuous latent variables. This has been recently solved by [van den Oord et al., 2017](https://arxiv.org/abs/1711.00937), with their famous VQ-VAE.
SGVB allows us to estimate the ELBO _for a single datapoint_, but we need to estimate it for all $N$. To do this, we use minibatches of size $M$: the gradients are computed with automatic differentiation, and the parameter values are updated with some gradient ascent method, such as SGD, RMSProp or Adam. The combination of the SGVB estimator with the minibatch stochastic gradient on $\phi,\theta$ is the famous Auto-Encoding Variational Bayes (AEVB), which gives title to the Kingma and Welling paper.

## Amortized Variational Inference
At this point, you may feel cheated: I haven't mentioned VAEs even once (except in the click-baity introduction). This was done on purpose, to show that the AEVB algorithm is much more general than just Variational Autoencoders, which are simply an instantiation of it. This gives you the possibility to use AEVB to learn more general models than just the original VAE. For example, now you know:
- that $q_{\phi}(\mathbf{z})$ doesn't have to be in the Gaussian family
- that you could in principle learn a _different_ $q_{\phi}(\mathbf{z})$ for each datapoint (that's why we condition on $\mathbf{x}_i$)
- why the reparametrization trick is so great (it reduces the variance of the estimate of the gradient of the ELBO with respect to $\phi$)
- why just creating a bigger & badder encoder-decoder NN won't , by itself, reduce the amortization gap
Contrast that with some introductions which simply start from Eq.($\ref{e}$), and then dive straight into the architecture of encoder & decoder.
Until now, we haven't specified the form of $p_{\theta}(\mathbf{z})$, $p_{\theta}(\mathbf{x}_i\vert\mathbf{z})$, and $q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)$. Also, we saw that we're learning a different variational (approximate) posterior $q_{\phi}(\mathbf{z}\vert\mathbf{x}_i$) for each datapoint. We could do that by using nonparametric density estimation, but of course, we wouldn't expect that to scale well to large datasets. It would be probably smarter to **learn** a mapping from the data points to the optimal variational parameters, $f_{\phi}:\mathcal{X}\to\mathcal{Q}$, for each value of $\theta$. Now, which tool do we know which allows to efficiently learn complicated, nonlinear mappings between high-dimensional spaces? Neural networks, of course! Of course, if at each $\phi$ optimization step we had to retrain the neural network over the whole data set, of course we wouldn't have saved computation: the effort would still be proportional to the size of the dataset. But in practice we interleave the optimization on $\theta$ and on $\phi$ over each minibatch. This way, by introducing neural networks we _amortized_ the cost of variational inference ($q_{\phi}(\mathbf{z}\vert\mathbf{x}_1),\dots,q_{\phi}(\mathbf{z}\vert\mathbf{x}_N)$). If we use neural networks also to parametrize $p_{\theta}(\mathbf{z})$ and $p_{\theta}(\mathbf{x}_i\vert\mathbf{z})$, the resulting model is called the **Variational Autoencoder**.
In the Kingma & Welling paper, the following choices were made for the case of real data:
- $p_{\theta}(\mathbf{z})=\mathcal{N}(0,I)$ (thus the prior has no parameters)
- $p_{\theta}(\mathbf{x}\vert\mathbf{z})=\mathcal{N}(\mathbf{x};\mu_{\theta}(\mathbf{z}),\boldsymbol{\sigma}^2_{\theta}(\mathbf{z})\odot I))$ where $\mu_{\theta}(\mathbf{z})$, $\sigma_{\theta}^2(\mathbf{z})$ are two single hidden layer MLPs of 500 units each in the original paper. For a latent sample $z$, this neural network _decodes_ the parameters of $p_{\theta}(\mathbf{x}\vert \mathbf{z})$ which give optimal input reconstruction. For this reason, it's called the _decoder_[^5].
- $q_{\phi}(\mathbf{z}\vert\mathbf{x})=\mathcal{N}(\mathbf{z};\mu_{\phi}(\mathbf{x}),\boldsymbol{\sigma}^2_{\phi}(\mathbf{x})\odot I)$, so basically the same neural network as used for the decoder. For an input sample $x$, this neural network learns the variational parameters which correspond to an optimal _latent code_ $z$, thus we call it the _encoder_.

#### Learning the VAE
The weights of both neural networks are learnt at the same time using AEVB: note that with this simple choice of $p_{\theta}(\mathbf{z})$ and $q_{\phi}(\mathbf{z}\vert\mathbf{x})$, the term $D_{KL}(q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\vert\vert p_{\theta}(\mathbf{z}))$ (the regularization term) has an analytical expression, thus the Monte Carlo estimate is only needed for the reconstruction term and its gradient.
#### Generating samples
At inference time, we sample a latent code $\mathbf{z}\sim\mathcal{N}(0,I)$ and then we propagate it through the decoder, thus the encoder is not used anymore.
### Experimental results
The quality of the samples generated by the original VAE on both MNIST have the classical, blurry "VAE" feeling:

###### from [Kingma & Welling, 2014](https://arxiv.org/pdf/1312.6114.pdf)
More recent results training a deep hierarchical Variational Autoencoder ([Maaløe et al., 2019](https://arxiv.org/abs/1902.02102)) on CelebA are much better:

We're still a far cry from GANs, but at least the performance of autoregressive models are now matched. The Two-Stage VAE [Dai & Wipf, 2019](https://arxiv.org/abs/1903.05789) is even better:

This matches the quality of smaller GAN models, even though it's not as good as models like StyleGAN or BigGAN.
The SenVAE [Pelsmaeker & Aziz, 2019](https://arxiv.org/abs/1904.08194) is a first step towards a usable VAE LM:

### Wrapping up
VAEs are deep latent variable models, which can learn complex input distributions. The training algorithm is a specific instantiation of a more general algorithm, the AEVB, which is a "Swiss Army knife" to learn models with intractable likelihood over large datasets. The (vanilla) VAE is an instantiation of AEVB, where all the distributions involved are parametrized as multivariate Gaussians.
With respect to other generative models (GANs), the VAE enjoys stable training, and an interpretable encoder/inference network, but a lower sample quality. However, latest VAE models reduce the quality gap considerably. And they rock! :tada:
[^1]: of course, none of these are the real reasons why people use VAE. VAE are cool because they are one of the very few cases in Deep Learning where theory actually informs architecture design (the only other which comes to my mind are [gauge equivariant neural networks](https://arxiv.org/pdf/1902.04615.pdf)).
[^2]: if this is the first time you meet the term "intractable" and you're anything like me, you'll want to know what it actually means (roughly, that the expression takes exponential time to compute) and why _exactly_ this integral should be intractable. Read section 2.1 of [this great paper](https://arxiv.org/pdf/1601.00670.pdf) for the proof that even the marginal likelihood of a simple LVM such as a Bayesian GMM (Gaussian Mixture Model) is intractable.
[^3]: If at this point you feel a crushing weight over your soul, that's fine: it's just the normal reaction to seeing the derivation of the VAE objective function for the first time. If it's any consolation, just think that in their original paper, Kingma & Welling casually introduce the ELBO in the first equation after $\ref{a}$, and summarize all the VI part of the VAE model in a very short paragraph. Such are the heights to which great minds soar.
[^4]: as proved in the appendix of [Rezende et al., 2014](https://arxiv.org/abs/1401.4082)
[^5]: don't be fooled by the apparent simplicity of $p_{\theta}(\mathbf{z})$ and $p_{\theta}(\mathbf{x}\vert\mathbf{z})$. Even though they're "simple" multivariate gaussian (with a diagonal covariance matrix, even!), $p_{\theta}(\mathbf{x}) = \int p_{\theta}(\mathbf{x}\vert\mathbf{z})p_{\theta}(\mathbf{z})d\mathbf{z}$ is a mixture of _an infinite number of multivariate Gaussians_, each of them with its own mean and variance vectors. Thus it's a much more complicated and flexible distribution, with an essentially arbitrary number of modes, that can model such complicated data distributions as that of celebrity faces.