VAE are deep latent variable generative models
They are applied to :
- density estimation (image/sound generation, missing data imputation, graph generation)
- automatic molecule design
- semi-supervised learning
- representation learning for downstream tasks
- model-based reinforcement learning (to build a world model)
  \[\DeclareMathOperator\supp{supp} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min}\]

What's a generative model?

A model which learns the probability distribution \(p(x)\) over the input space \(\mathcal{X}\) (such as GANs)
after training, we can sample from (our estimate of) \(p(x)\)

What's a latent variable model?

The random vector \(x\in\mathcal{X}\) is modeled as a (possibly very complicated) function of a random vector \(z\in\mathcal{Z}\), with \(r=\dim{\mathcal{Z}} \ll d=\dim{\mathcal{X}}\)
Unlike \(x\), \(z\) is not observed, thus we call its components latent variables
it makes sense if we think that input samples lie on a manifold of dimension \(r \ll d\) (e.g., ImageNet samples)

in practice, we assume a data generating process represented in the following DAG (Directed Acyclic Graph)

\[ \begin{align} z &\sim p_{\theta^*}(z) \\ x\vert z &\sim p_{\theta^*} (x\vert z) \end{align}\]


digraph G {

    splines=line;

    subgraph cluster1 {
        node [style=filled, shape=circle];
        edge [color=blue]
        z[fillcolor=white, color=black]
        z -> x;
    }
    theta[label = "θ", shape=circle]
    edge [color=black, style="dashed"]
    theta -> z [constraint=false]
    theta -> x 

}

\(p_{\theta^*}(z)\) and \(p_{\theta^*}(x\vert z)\) come from (possibly different) parametric families

Estimating the model

Goal: given an iid dataset of size \(N\), \(X=\{x_i\}_{i=1}^N\), estimate \(\theta^*\)
standard approach: maximize data likelihood, marginalized over latent variables

\[ \hat{\theta}=\argmax_{\theta\in\Theta}p_{\theta}(X) =\argmax_{\theta\in\Theta}\sum_{i=1}^N\log{p_{\theta}(x_i)} \]

Two challenges

computing \(p_{\theta}(x_i)\) requires computing the intractable integral

\[ p_{\theta}(x_i) = \int p_{\theta}(x_i\vert z)p_{\theta}(z)dz \label{a}\tag{1} \]
the integral must be recomputed for all \(N\) samples of a large dataset. This rules out batch optimization or sampling-based solutions such as Monte Carlo EM

We need to get waaay smarter.

Enter Variational Inference

To deal with the intractability of marginal likelihood, we use Variational Inference
Introduce a family \(\mathcal{Q}\) of approximation \(q_{\phi}(z\vert x)\) to the true posterior \(p_{\theta}(z\vert x)\), and find \(\phi^*\) such that \(q_{\phi}(z\vert x)\) is "closest" to \(p_{\theta}(z\vert x)\) according to some similarity measure
we only assume that \(q_{\phi}(z\vert x)\) is differentiable with respect to \(\phi\) (black box VI)

Kullback-Leibler divergence

Goal: find \(\phi^*\) which minimizes the KL-divergence between \(q_{\phi^*}(z\vert x)\in\mathcal{Q}\) and \(p_{\theta}(z\vert x)\)

\[\begin{multline}\phi^* = \argmin_{\phi\in\Phi} D_{KL}[q_{\phi}(z\vert x)\vert\vert p_{\theta}(z\vert x)]= \\ \argmin_{\phi\in\Phi} \int q_{\phi}(z\vert x) \log{\frac{q_{\phi}(z\vert x)}{p_{\theta}(z\vert x)}}dz \end{multline}\]
we note two properties of \(D_{KL}\):
\[ \begin{align} D_{KL}[q\vert\vert p] &\ge0 \ \forall p,q \label{b}\tag{2} \\ D_{KL}[q\vert\vert p] &= 0 \iff p = q \ \text{a.e.} \label{c}\tag{3} \end{align}\]

The ELBO is so called because it's a lower bound on the marginal log-likelihood (or evidence), since \(D_{KL}[q\vert\vert p]\ge0\):

\[ \log{p_{\theta}( x_i)} \geq \mathcal{L}(\phi,\theta; x_i) \]

Thus, maximizing the ELBO goes into the direction of maximizing the marginal log-likelihood, but without having to compute the intractable integral

ELBO properties (1/2)

because of Eq.\(\ref{c}\), maximizing the ELBO for a data point \(x_i\) is equivalent to maximizing the marginal log-likelihood iff \(\exists\phi_i^*\mid q_{\phi_i^*}(z\vert x_i)=p_{\theta}(z\vert x_i)\)
If such a \(\phi_i^*\) doesn't exist, then maximizing the ELBO doesn't also maximize the marginal log-likelihood, and the remaining gap (approximation gap) is equal to \(D_{KL}[q_{\phi_i}(z\vert x_i)\vert\vert p_{\theta}(z\vert x_i)]\)

from Stefano Ermon, Aditya Grover, Latent Variable Models

ELBO properties (2/2)

The ELBO can be rewritten as
\[ \mathbb{E}_{q_{\phi}(z \vert x_i)}[\log{p_{\theta}(x_i\vert z)}] - D_{KL}[q_{\phi}(z\vert x_i)\vert\vert p_{\theta}(z)] \label{f}\tag{6} \]
combining Eq.\(\ref{e}\) and Eq.\(\ref{f}\), we can interpret the term
\[\sum_{i=1}^N\mathbb{E}_{q_{\phi}(z\vert x_i)}[\log{p_{\theta}( x_i\vert z)}]\]
as a reconstruction quality, and the term
\[-\sum_{i=1}^N D_{KL}[q_{\phi}(z\vert x_i)\vert\vert p_{\theta}(z)]\]
as a regularizer, since it penalizes \(q_{\phi}(z\vert x_i)\) for being too dissimilar from the prior \(p_{\theta}(z)\).

The SGVB estimator

We solved intractability, but how do we maximize the BBVI objective in a scalable way?
Stochastic Gradient Ascent!
We need SG-based estimators for the ELBO and its gradient with respect to \(\theta\) and \(\phi\)

\[ \nabla_{\theta,\phi}\mathbb{E}_{q_{\phi}(z\vert x_i)}[\log{p_{\theta}(x_i, z)}-\log{q_{\phi}(z\vert x_i)}] \]
The gradient with respect to \(\theta\) is immediate:

\[ \mathbb{E}_{q_{\phi}(z\vert x_i)}[\nabla_{\theta}\log{p_{\theta}(x_i, z)}] \]

we can estimate the expectation using Monte Carlo.

The gradient with respect to \(\phi\) is more badass:

\[ \nabla_{\phi}\mathbb{E}_{q_{\phi}(z\vert x_i)}[\log{p_{\theta}(x_i, z)}-\log{q_{\phi}(z\vert x_i)}] \]
since the expectation and the gradient are both w.r.t \(q_{\phi}\), we cannot simply swap them.
The key contribution of Kingma and Welling, 2014 is the introduction of a low-variance estimator for this gradient, the SGVB (Stochastic Gradient Variational Bayes) estimator, based on the reparametrization trick.

The reparametrization trick

For many differentiable parametric families, it's possible to draw samples of \(\tilde{z}\sim q_{\phi}(z\vert x_i)\), by sampling from a simple distribution \(p(\epsilon)\), (e.g. \(\mathcal{N}(0,I)\)), and then applying a differentiable, deterministic function \(g_{\phi}(\epsilon, x)\) to \(\epsilon\) (e.g., \(g_{\phi}(s)=\mu+\sigma s\))
The resulting random variable \(\tilde{z} = g_{\phi}(\epsilon, x)\) is indeed distributed as \(q_{\phi}(z\vert x)\) (image from Durk Kingma's NIPS 2015 workshop slides)

The biggest selling point of the reparametrization trick is that we can now write \(\nabla_{\phi}\mathbb{E}_{q_{\phi}(z\vert x_i)}[f(z)]\) for any function \(f(z)\) as

\[ \nabla_{\phi}\mathbb{E}_{q_{\phi}(z\vert x_i)}[f(z)]=\nabla_{\phi}\mathbb{E}_{p(\epsilon)}[f(g_{\phi}(\epsilon,x_i))]=\mathbb{E}_{p(\epsilon)}[\nabla_{\phi}f(g_{\phi}(\epsilon,x_i))]\]
Using Monte Carlo to estimate this expectation, we obtain the SGVB estimator, which has lower variance than other SG-based estimators such the score function estimator, allowing us to learn more complicated models.
What about discrete latent variables? See van den Oord et al., 2017, with their famous VQ-VAE.

The AEVB algorithm

SGVB allows us to estimate the ELBO for a single datapoint, but we need to estimate it for all \(N\). To do this, we use minibatches of size \(M\) (from Kingma and Welling, 2014)

Amortized inference

rather than having to solve an optimization problem for each data point \(x_i\), it's smarter to learn a different mapping from \(f_{\phi}:\mathcal{X}\to\mathcal{Q}\), for each value of \(\theta\)
we need an encoding function which can efficiently learn complicated, nonlinear mappings between high-dimensional spaces i.e., Neural Networks!!
To actually save computation, we interleave the optimization on \(\theta\) and on \(\phi\) for each minibatch.
This way, by introducing neural networks we amortized the cost of variational inference (\(q_{\phi}(z\vert x_1),\dots,q_{\phi}(z\vert x_N)\))
If we use neural networks also to parametrize \(p_{\theta}(z)\) and \(p_{\theta}(x_i\vert z)\), the result is the Variational Autoencoder

Learning the VAE

The weights of both neural networks are learnt at the same time using AEVB: note that with this simple choice of \(p_{\theta}(z)\) and \(q_{\phi}(z\vert x)\), the term \(D_{KL}[q_{\phi}(z\vert x_i)\vert\vert p_{\theta}(z)]\) (the regularization term) has an analytical expression, thus the Monte Carlo estimate is only needed for the reconstruction term and its gradient.

Generating samples

At inference time, we sample a latent code \(z\sim\mathcal{N}(0,I)\) and then we propagate it through the decoder, thus the encoder is not used anymore.

Experimental results

The quality of the samples generated by the original VAE on MNIST have the classical, blurry "VAE" feeling:

More recent results training the Bidirectional-Inference Variational Autoencoder (BIVA) (Maaløe et al., 2019) on CelebA are much better:

The Two-Stage VAE Dai & Wipf, 2019 is even better

But it still doesn't match the performance of models such as StyleGAN or BigGAN. But don't lose hope!

The impressive DeepMind VQ-VAE-2 from Razavi et al., 2019 not only generates images of comparable visual quality to the most advanced GAN models…

…but it also performs better according to the newly introduced Classification Accuracy Score (CAS):

VAE can write! (sort of)

The topic-guided variational autoencoder (TGVAE) Wang et al., 2019 was presented at the ACL 2019, and it's a nice example of a Language Model implemented using topic modelling and VAE

Wrap up

the VAE is a deep latent variable model, which can learn complex input distributions
the training algorithm is a specific instantiation of a more general algorithm, the AEVB, which is a "Swiss Army knife" to learn models with intractable likelihood over large datasets
the (vanilla) VAE is an instantiation of AEVB, where all the distributions involved are parametrized as multivariate Gaussians
with respect to other generative models (GANs), the VAE enjoys stable training, and an interpretable encoder/inference network, but a lower sample quality
However, latest VAE models reduce the quality gap considerably, and seem to be more effective at representation learning for downstream tasks, at least as measured from the CAS metric
VAE rocks

Thank you for your attention!

An intro to Variational Autoencoders