The Variational Autoencoder (VAE) is a not-so-new-anymore Latent Variable Model (Kingma & Welling, 2014), which by introducing a probabilistic interpretation of autoencoders, allows to not only estimate the variance/uncertainty in the predictions, but also to inject domain knowledge through the use of informative priors, and possibly to make the latent space more interpretable. VAEs have various applications:
Unlike classical (sparse, denoising, etc.) autoencoders, VAEs are probabilistic generative models, like GANs (Generative Adversarial Networks). With generative we denote a model which learns the probability distribution over the input space . This means that after training such a model, we can then sample from our approximation of . If our training set contains handwritten digits (MNIST), then the trained generative model is able to create images which look like handwritten digits, even though they're not "copies" of the images in the training set. In the case of images, if our training set contains natural images (CIFAR-10) or celebrity faces (CelebA), it will generate images which look like photo pics.
Learning the distribution of the images in the training set implies that images which look like handwritten digits (for example) have an high probability of being generated, while images which look like the Jolly Roger or random noise have a low probability. In other words, it means learning about the dependencies among pixels: if our image is a pixels grayscale image from MNIST, the model should learn that if a pixel is very bright, then there's a significant probability that some neighboring pixels are bright too, that if we have a long, slanted line of bright pixels we may have another smaller, horizontal line of pixels above this one (a 7), etc.
The VAE is a Latent Variable Model (LVM): this means that , the random vector of the 784 pixel intensities (the observed variables), is modeled as a (possibly very complicated) function of a random vector of lower dimensionality, whose components are unobserved (latent) variables. Coincisely, we assume a data generating process with the following DAG (Directed Acyclic Graph):
And we assume that both the prior and the likelihood come from some parametric families (possibly different).
Latent variable models are also sometimes called hierarchical or multilevel models, and they are models that use the rules of conditional probability to specify complicated probability distributions over high dimensional spaces, by composing simpler probability density functions. The original VAE has only two levels, but recently a paper on a deep hierarchial VAE with multiple levels of latent variables has been published (note that the hierarchy of latent variables in a probabilistic model has nothing to do with the layers of a neural network).
When does a latent variable model make intuitive sense? For example, in the MNIST case we think that the handwritten digits belong to a manifold of dimension much smaller than the dimension of , because the vast majority of random arrangements of 784 pixel intensities, don't look at all like handwritten digits. The representation of the image should be equivariant to certain transformations (e.g., rotations, translations and small deformations) but not to others. So in this case it makes sense to think that the image samples are generated by taking samples (which we don't observe) in a sample space of much smaller dimension, and then transforming them according to some complicated function.
We are now given a dataset consisting of i.i.d. samples, (note again that we do not observe ) and we want to estimate . The standard "recipe" would be to maximize the likelihood of the data (marginalized over the latent variables), i.e., Maximum Likelihood Estimation:
However, we're immediately faced with two challenges:
computing requires solving the integral
which is often intractable [2], even in the simple case of one hidden layer nonlinear neural networks.
we need to compute the integral for all data samples of a large dataset, which rules out either batch optimization, or sampling-based solutions such as Monte Carlo EM, which would require an expensive sampling step for each .
We need to get waaay smarter.
To solve the first problem, i.e., the intractability of the marginal likelihood, we use an age-old (in ML years) trick: Variational Inference (introduced in Jordan et al., 1999 and nicely reviewed in Blei et al., 2018). VI transforms the inference problem into an optimization problem. This way, we get rid of integration, but we pay a price, as we'll see later. Fair warning: this section is more mathy than the rest. I'll introduce necessary concepts such as the Kullback-Leibler divergence, but you may want to skip to the next session, for a first read.
VI starts by introducing a family of parametric approximations to the true posterior distribution , indexed by . The goal of VI is to find the value(s) of such that is "closest" to in some sense. We only assume that is differentiable with respect to (black box VI), but we don't make further assumptions such as for example assuming a fully factorized form for , as in the case of mean field variational inference.
The similarity measure we we seek to minimize is the Kullback-Leibler divergence
which is a similarity measure for probabilities distributions. For the following, we'll need two properties of :
However, our primary goal is to estimate the generative model parameters through Maximum (Marginal) Likelihood Estimation. Thus, let's try to rewrite the marginal log-likelihood of a data point in terms of :
where in the second-to-last inequality we used Bayes' rule, and the last equality holds if (otherwise it becomes an inequality). We can now see a term similar to inside the integral, so let's pop it out:
Summarizing:
The term is called the Evidence (or variational) Lower BOund (ELBO for friends), because it's always no greater than the marginal likelihood (or evidence) for datapoint , since :
Thus, maximizing the ELBO goes into the direction of maximizing the marginal likelihood, our original goal. However, the ELBO doesn't contain the pesky marginal evidence, which is what made the problem intractable.
Dropping the and summing on all data points, the learning objective of BBVI (Black Box Variational Inference) becomes
Summarizing, we learn a LVM by maximizing the ELBO with respect to both the model parameters and the variational parameters for each data point .
Note that until now, we haven't mentioned either neural networks or VAE. The approach has been very general, and it could apply to any Latent Variable Model which has the DAG representation shown above.
The ELBO can be written as
or alternatively:
This last form lends itself to some nice interpretations, as we'll see later[3].
We list here some useful properties of the ELBO. Not all of them will be needed in the following, but we list them anyway as an useful companion for reading other tutorials or papers on VAEs.
for each data point , the true posterior distribution may or may not belong to . If it does, then property 2 of implies that is maximized for , and maximizing the ELBO is equivalent to maximizing the marginal likelihood. If it doesn't, then, even for the optimal value , there will be a nonzero difference between the ELBO and the actual marginal likelihood of the data point, equal to .
The residual error is called approximation gap (Cremer et al., 2017) and it can only be reduced by expanding the variational family (Kinga et al., 2016, Tomczak & Welling, 2017 and many others). However, overly flexible inference models can also have side effects (Rui et al., 2018).
combining Eq. and Eq., we see that maximizing the term maximizes the probability that, under repeated sampling from , we generate samples which are similar to the training samples . For this reason it's often interpreted as a reconstruction quality (or its opposite is interpreted as a reconstruction loss). The term penalizes the flexibile for being too dissimilar from the prior . In other words, it's a regularizer. Thus the ELBO can be interpreted as the sum of a reconstruction error term, and a regularizer term. This is legit, but let's not forget that the goal of VI is still to maximize the marginal likelihood of the training data.
Introducing BBVI, we solved intractability, but we still need a way to maximize the BBVI objective in a scalable way. To this end, we introduce the Stochastic Gradient-based estimator for the ELBO and its gradient with respect to and , called Stochastic Gradient Variational Bayes (SGVB). We want to use stochastic gradient ascent, thus we need the gradient of Eq.:
The gradient with respect to is immediate:
and we can estimate the expectation using Monte Carlo.
The gradient with respect to is more badass: since the expectation and the gradient are both with respect to , we cannot simply swap them. As shown in Mnih & Gregor, 2014, we can prove that
Again, now that the gradient is inside the expectation, we could use Monte Carlo to estimate it. However, the resulting estimator, called the score function estimator, has high variance. The key contribution of Kingma and Welling, 2014, is an alternative estimator, which is much more well-behaved: the SGVB estimator, based on the reparametrization trick. For many differentiable parametric families, it's possible to draw samples of the random variable with a two-step generative process:
A classic example is the univariate Gaussian. Let . Then of course if and , we know that , as desired. There are many other families which can be similarily reparametrized. Another way to see the relevance of the reparametrization trick, is to note that by moving the random node outside the part of the computational graph containing the variational parameters, it allows us to backpropagate through the graph. See this image from a presentation of Durk Kingma:
The biggest selling point of the reparametrization trick is that we can now write the gradient of the expectation with respect to of any function as
Using Monte Carlo to estimate this expectation we obtain an estimator (the SGVB estimator) which has lower variance than the score function estimator[4], allowing us to learn more complicated models. The reparametrization trick was then extended to discrete variables (Maddison et al, 2016, Jang et al., 2016), which allowed training VAE with discrete latent variables. None of these works, though, closed the performance gap of VAEs with continuous latent variables. This has been recently solved by van den Oord et al., 2017, with their famous VQ-VAE.
SGVB allows us to estimate the ELBO for a single datapoint, but we need to estimate it for all . To do this, we use minibatches of size : the gradients are computed with automatic differentiation, and the parameter values are updated with some gradient ascent method, such as SGD, RMSProp or Adam. The combination of the SGVB estimator with the minibatch stochastic gradient on is the famous Auto-Encoding Variational Bayes (AEVB), which gives title to the Kingma and Welling paper.
At this point, you may feel cheated: I haven't mentioned VAEs even once (except in the click-baity introduction). This was done on purpose, to show that the AEVB algorithm is much more general than just Variational Autoencoders, which are simply an instantiation of it. This gives you the possibility to use AEVB to learn more general models than just the original VAE. For example, now you know:
Contrast that with some introductions which simply start from Eq.(), and then dive straight into the architecture of encoder & decoder.
Until now, we haven't specified the form of , , and . Also, we saw that we're learning a different variational (approximate) posterior ) for each datapoint. We could do that by using nonparametric density estimation, but of course, we wouldn't expect that to scale well to large datasets. It would be probably smarter to learn a mapping from the data points to the optimal variational parameters, , for each value of . Now, which tool do we know which allows to efficiently learn complicated, nonlinear mappings between high-dimensional spaces? Neural networks, of course! Of course, if at each optimization step we had to retrain the neural network over the whole data set, of course we wouldn't have saved computation: the effort would still be proportional to the size of the dataset. But in practice we interleave the optimization on and on over each minibatch. This way, by introducing neural networks we amortized the cost of variational inference (). If we use neural networks also to parametrize and , the resulting model is called the Variational Autoencoder.
In the Kingma & Welling paper, the following choices were made for the case of real data:
The weights of both neural networks are learnt at the same time using AEVB: note that with this simple choice of and , the term (the regularization term) has an analytical expression, thus the Monte Carlo estimate is only needed for the reconstruction term and its gradient.
At inference time, we sample a latent code and then we propagate it through the decoder, thus the encoder is not used anymore.
The quality of the samples generated by the original VAE on both MNIST have the classical, blurry "VAE" feeling:
More recent results training a deep hierarchical Variational Autoencoder (Maaløe et al., 2019) on CelebA are much better:
We're still a far cry from GANs, but at least the performance of autoregressive models are now matched. The Two-Stage VAE Dai & Wipf, 2019 is even better:
This matches the quality of smaller GAN models, even though it's not as good as models like StyleGAN or BigGAN.
The SenVAE Pelsmaeker & Aziz, 2019 is a first step towards a usable VAE LM:
VAEs are deep latent variable models, which can learn complex input distributions. The training algorithm is a specific instantiation of a more general algorithm, the AEVB, which is a "Swiss Army knife" to learn models with intractable likelihood over large datasets. The (vanilla) VAE is an instantiation of AEVB, where all the distributions involved are parametrized as multivariate Gaussians.
With respect to other generative models (GANs), the VAE enjoys stable training, and an interpretable encoder/inference network, but a lower sample quality. However, latest VAE models reduce the quality gap considerably. And they rock!
of course, none of these are the real reasons why people use VAE. VAE are cool because they are one of the very few cases in Deep Learning where theory actually informs architecture design (the only other which comes to my mind are gauge equivariant neural networks). ↩︎
if this is the first time you meet the term "intractable" and you're anything like me, you'll want to know what it actually means (roughly, that the expression takes exponential time to compute) and why exactly this integral should be intractable. Read section 2.1 of this great paper for the proof that even the marginal likelihood of a simple LVM such as a Bayesian GMM (Gaussian Mixture Model) is intractable. ↩︎
If at this point you feel a crushing weight over your soul, that's fine: it's just the normal reaction to seeing the derivation of the VAE objective function for the first time. If it's any consolation, just think that in their original paper, Kingma & Welling casually introduce the ELBO in the first equation after , and summarize all the VI part of the VAE model in a very short paragraph. Such are the heights to which great minds soar. ↩︎
as proved in the appendix of Rezende et al., 2014 ↩︎
don't be fooled by the apparent simplicity of and . Even though they're "simple" multivariate gaussian (with a diagonal covariance matrix, even!), is a mixture of an infinite number of multivariate Gaussians, each of them with its own mean and variance vectors. Thus it's a much more complicated and flexible distribution, with an essentially arbitrary number of modes, that can model such complicated data distributions as that of celebrity faces. ↩︎