An intro to Variational Autoencoders

# An intro to Variational Autoencoders  slides: https://hackmd.io/p/SkR7GhwqN#/ ![](https://i.imgur.com/YC1zAsj.png =600x) --- - VAE are _deep latent variable generative models_ - They are applied to : - density estimation (image/sound generation, missing data imputation, _graph generation_) - automatic molecule design - semi-supervised learning - representation learning for downstream tasks - model-based reinforcement learning (to build a **world model**) $$\DeclareMathOperator\supp{supp} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min}$$ <style> .reveal { font-size: 30px; } </style> --- ### _What's a generative model?_ - A model which learns the probability distribution $p(x)$ over the input space $\mathcal{X}$ (such as GANs) - after training, we can sample from (our estimate of) $p(x)$ ![](https://i.imgur.com/n8tVcHc.png) --- ### _What's a latent variable model?_ - The random vector $x\in\mathcal{X}$ is modeled as a (possibly very complicated) function of a random vector $z\in\mathcal{Z}$, with $r=\dim{\mathcal{Z}} \ll d=\dim{\mathcal{X}}$ - Unlike $x$, $z$ is not observed, thus we call its components _latent variables_ - it makes sense if we think that input samples lie on a manifold of dimension $r \ll d$ (e.g., ImageNet samples)  --- - in practice, we assume a data generating process represented in the following DAG (Directed Acyclic Graph) $$ \begin{align} z &\sim p_{\theta^*}(z) \\ x\vert z &\sim p_{\theta^*} (x\vert z) \end{align}$$ ```graphviz digraph G { splines=line; subgraph cluster1 { node [style=filled, shape=circle]; edge [color=blue] z[fillcolor=white, color=black] z -> x; } theta[label = "θ", shape=circle] edge [color=black, style="dashed"] theta -> z [constraint=false] theta -> x } ``` - $p_{\theta^*}(z)$ and $p_{\theta^*}(x\vert z)$ come from (possibly different) parametric families --- ### Estimating the model - **Goal**: given an iid dataset of size $N$, $X=\{x_i\}_{i=1}^N$, estimate $\theta^*$ - standard approach: maximize data likelihood, marginalized over latent variables $$ \hat{\theta}=\argmax_{\theta\in\Theta}p_{\theta}(X) =\argmax_{\theta\in\Theta}\sum_{i=1}^N\log{p_{\theta}(x_i)} $$ --- ### Two challenges 1. computing $p_{\theta}(x_i)$ requires computing the intractable integral $$ p_{\theta}(x_i) = \int p_{\theta}(x_i\vert z)p_{\theta}(z)dz \label{a}\tag{1} $$ 2. the integral must be recomputed for all $N$ samples of a large dataset. This rules out batch optimization or sampling-based solutions such as [Monte Carlo EM](https://arxiv.org/abs/1206.4768) We need to get _waaay_ smarter. --- ## Enter Variational Inference - To deal with the intractability of marginal likelihood, we use [Variational Inference](https://people.eecs.berkeley.edu/~jordan/papers/variational-intro.pdf) - Introduce a family $\mathcal{Q}$ of approximation $q_{\phi}(z\vert x)$ to the true posterior $p_{\theta}(z\vert x)$, and find $\phi^*$ such that $q_{\phi}(z\vert x)$ is "closest" to $p_{\theta}(z\vert x)$ according to some similarity measure - we only assume that $q_{\phi}(z\vert x)$ is differentiable with respect to $\phi$ (_black box VI_) --- #### [Kullback-Leibler divergence](https://en.wikipedia.org/wiki/Kullback-Leibler_divergence) - **Goal**: find $\phi^*$ which minimizes the KL-divergence between $q_{\phi^*}(z\vert x)\in\mathcal{Q}$ and $p_{\theta}(z\vert x)$ $$\begin{multline}\phi^* = \argmin_{\phi\in\Phi} D_{KL}[q_{\phi}(z\vert x)\vert\vert p_{\theta}(z\vert x)]= \\ \argmin_{\phi\in\Phi} \int q_{\phi}(z\vert x) \log{\frac{q_{\phi}(z\vert x)}{p_{\theta}(z\vert x)}}dz \end{multline}$$ - we note two properties of $D_{KL}$: $$ \begin{align} D_{KL}[q\vert\vert p] &\ge0 \ \forall p,q \label{b}\tag{2} \\ D_{KL}[q\vert\vert p] &= 0 \iff p = q \ \text{a.e.} \label{c}\tag{3} \end{align}$$ --- ### ELBO - Our primary goal is still to estimate $\theta^*$ through Maximum (Marginal) Likelihood Estimation - We can rewrite $\log{p_{\theta}(x_i)}$ in terms of $D_{KL}[q_{\phi}\vert\vert p_{\theta}]$ as $$ \log{p_{\theta}(x_i)} = D_{KL}[q_{\phi}(z\vert x_i)\vert\vert p_{\theta}(z\vert x_i)]+ \mathcal{L}(\phi,\theta;x_i) \label{d}\tag{4}$$ - $\mathcal{L}(\phi,\theta;x_i) = \int q_{\phi}(z\vert x_i)\log{\frac{p_{\theta}(x_i, z)}{q_{\phi}(z\vert x_i)}} dz$ is called ELBO (**E**vidence **L**ower **BO**und) - see nested slides for proofs ---- ### Simpler (indirect) proof $$ \begin{multline}D_{KL}[q_{\phi}(z\vert x_i)\vert\vert p_{\theta}(z\vert x_i)]= \mathbb{E}_{q_{\phi}(z\vert x_i)}\left[\log{\frac{q_{\theta}( z\vert x_i)}{p_{\theta}(z\vert x_i)}}\right] = \\ \mathbb{E}_{q_{\phi}(z\vert x_i)}\left[\log{\frac{q_{\theta}(z\vert x_i)p_{\theta}(x_i)}{p_{\theta}(x_i, z)}}\right]= \\ \mathbb{E}_{q_{\phi}(z\vert x_i)}\left[\log{\frac{q_{\theta}(z\vert x_i)}{p_{\theta}(x_i, z)}}\right]+\mathbb{E}_{q_{\phi}(z\vert x_i)}\left[\log{p_{\theta}(x_i)}\right]=\\ -\mathcal{L}(\phi,\theta;x_i)+\log{p_{\theta}(x_i)} \end{multline} $$ ---- ### Alternative proof: step 1 $$ \begin{multline}\log{p_{\theta}(x_i)} = \log{p_{\theta}(x_i)}\int q_{\phi}(z\vert x_i)dz = \\ \int q_{\phi}(z\vert x_i)\log{p_{\theta}( x_i)} dz = \\ \int q_{\phi}(z\vert x_i)\log{\frac{p_{\theta}( x_i, z)}{p_{\theta}(z\vert x_i)}} dz = \\ \int q_{\phi}(z\vert x_i) \log{\frac{p_{\theta}( x_i, z)}{p_{\theta}(z\vert x_i)}\frac{q_{\phi}(z\vert x_i)}{q_{\phi}(z\vert x_i)}} dz \end{multline}$$ ---- ### Step 2 $$ \begin{multline}\log{p_{\theta}( x_i)} = \int q_{\phi}(z\vert x_i) \log{\frac{q_{\phi}(z\vert x_i)}{p_{\theta}(z\vert x_i)}\frac{p_{\theta}( x_i, z)}{q_{\phi}(z\vert x_i)}}dz = \\ \int q_{\phi}(z\vert x_i) \left( \log{\frac{q_{\phi}(z\vert x_i)}{p_{\theta}(z\vert x_i)}}dz+\log{\frac{p_{\theta}( x_i, z)}{q_{\phi}(z\vert x_i)}} \right) dz= \\ D_{KL}[q_{\phi}(z\vert x_i)\vert\vert p_{\theta}(z\vert x_i)]+ \mathcal{L}(\phi,\theta; x_i) \end{multline}$$ --- - The ELBO is so called because it's a lower bound on the marginal log-likelihood (or _evidence_), since $D_{KL}[q\vert\vert p]\ge0$: $$ \log{p_{\theta}( x_i)} \geq \mathcal{L}(\phi,\theta; x_i) $$ - Thus, maximizing the ELBO goes into the direction of maximizing the marginal log-likelihood, but without having to compute the intractable integral --- ### The BBVI (Black Box Variational Inference) objective - Dropping the $D_{KL}$ term and summing on all data points, our learning objective becomes $$ \max_{\theta}\sum_{i=1}^N\max_{\phi}\mathcal{L}(\phi,\theta;x_i) \label{e}\tag{5}$$ - we learn a LVM (Latent Variable Model) by maximizing the ELBO with respect to both the model parameters $\theta$ and the variational parameters $\phi_i$ for each data point $x_i$. ---- **Note** that until now, we haven't mentioned either neural networks or VAE. The approach has been very general, and it could apply to any Latent Variable Model which has the DAG representation shown [initially](#/4). --- ### ELBO properties (1/2) - because of Eq.$\ref{c}$, maximizing the ELBO for a data point $x_i$ is equivalent to maximizing the marginal log-likelihood iff $\exists\phi_i^*\mid q_{\phi_i^*}(z\vert x_i)=p_{\theta}(z\vert x_i)$ - If such a $\phi_i^*$ doesn't exist, then maximizing the ELBO doesn't also maximize the marginal log-likelihood, and the remaining gap (_approximation gap_) is equal to $D_{KL}[q_{\phi_i}(z\vert x_i)\vert\vert p_{\theta}(z\vert x_i)]$ <img width="600" src="https://i.imgur.com/q6XMxtY.png"> ###### from Stefano Ermon, Aditya Grover, [Latent Variable Models](https://deepgenerativemodels.github.io/assets/slides/cs236_lecture6.pdf/) --- ### ELBO properties (2/2) - The ELBO can be rewritten as $$ \mathbb{E}_{q_{\phi}(z \vert x_i)}[\log{p_{\theta}(x_i\vert z)}] - D_{KL}[q_{\phi}(z\vert x_i)\vert\vert p_{\theta}(z)] \label{f}\tag{6} $$ - combining Eq.$\ref{e}$ and Eq.$\ref{f}$, we can interpret the term $$\sum_{i=1}^N\mathbb{E}_{q_{\phi}(z\vert x_i)}[\log{p_{\theta}( x_i\vert z)}]$$ as a _reconstruction quality_, and the term $$-\sum_{i=1}^N D_{KL}[q_{\phi}(z\vert x_i)\vert\vert p_{\theta}(z)]$$ as a _regularizer_, since it penalizes $q_{\phi}(z\vert x_i)$ for being too dissimilar from the prior $p_{\theta}(z)$. --- ## The SGVB estimator - We solved intractability, but how do we maximize the BBVI objective in a _scalable_ way? - Stochastic Gradient Ascent! - We need SG-based estimators for the ELBO and its gradient with respect to $\theta$ and $\phi$ $$ \nabla_{\theta,\phi}\mathbb{E}_{q_{\phi}(z\vert x_i)}[\log{p_{\theta}(x_i, z)}-\log{q_{\phi}(z\vert x_i)}] $$ - The gradient with respect to $\theta$ is immediate: $$ \mathbb{E}_{q_{\phi}(z\vert x_i)}[\nabla_{\theta}\log{p_{\theta}(x_i, z)}] $$ we can estimate the expectation using Monte Carlo. --- - The gradient with respect to $\phi$ is more badass: $$ \nabla_{\phi}\mathbb{E}_{q_{\phi}(z\vert x_i)}[\log{p_{\theta}(x_i, z)}-\log{q_{\phi}(z\vert x_i)}] $$ - since the expectation and the gradient are both w.r.t $q_{\phi}$, we cannot simply swap them. - The key contribution of [Kingma and Welling, 2014](https://arxiv.org/pdf/1312.6114.pdf) is the introduction of a low-variance estimator for this gradient, the SGVB (Stochastic Gradient Variational Bayes) estimator, based on the _reparametrization trick_. --- ### The reparametrization trick - For many differentiable parametric families, it's possible to draw samples of $\tilde{z}\sim q_{\phi}(z\vert x_i)$, by sampling from a simple distribution $p(\epsilon)$, (e.g. $\mathcal{N}(0,I)$), and then applying a differentiable, deterministic function $g_{\phi}(\epsilon, x)$ to $\epsilon$ (e.g., $g_{\phi}(s)=\mu+\sigma s$) - The resulting random variable $\tilde{z} = g_{\phi}(\epsilon, x)$ is indeed distributed as $q_{\phi}(z\vert x)$ (image from Durk Kingma's NIPS 2015 workshop slides) <img width="600" src="https://i.imgur.com/xvH1onJ.png"> --- - The biggest selling point of the reparametrization trick is that we can now write $\nabla_{\phi}\mathbb{E}_{q_{\phi}(z\vert x_i)}[f(z)]$ for any function $f(z)$ as $$ \nabla_{\phi}\mathbb{E}_{q_{\phi}(z\vert x_i)}[f(z)]=\nabla_{\phi}\mathbb{E}_{p(\epsilon)}[f(g_{\phi}(\epsilon,x_i))]=\mathbb{E}_{p(\epsilon)}[\nabla_{\phi}f(g_{\phi}(\epsilon,x_i))]$$ - Using Monte Carlo to estimate this expectation, we obtain the SGVB estimator, which [has lower variance than other SG-based estimators such the score function estimator](https://arxiv.org/abs/1401.4082), allowing us to learn more complicated models. - What about discrete latent variables? See [van den Oord et al., 2017](https://arxiv.org/abs/1711.00937), with their famous VQ-VAE. --- ## The AEVB algorithm SGVB allows us to estimate the ELBO _for a single datapoint_, but we need to estimate it for all $N$. To do this, we use minibatches of size $M$ (from Kingma and Welling, 2014) <img width="1000" src="https://i.imgur.com/15RcirL.png"> --- ### Amortized inference - rather than having to solve an optimization problem for each data point $x_i$, it's smarter to **learn** a different mapping from $f_{\phi}:\mathcal{X}\to\mathcal{Q}$, for each value of $\theta$ - we need an _encoding function_ which can efficiently learn complicated, nonlinear mappings between high-dimensional spaces i.e., Neural Networks!! - To actually save computation, we interleave the optimization on $\theta$ and on $\phi$ for each minibatch. - This way, by introducing neural networks we _amortized_ the cost of variational inference ($q_{\phi}(z\vert x_1),\dots,q_{\phi}(z\vert x_N)$) - If we use neural networks also to parametrize $p_{\theta}(z)$ and $p_{\theta}(x_i\vert z)$, the result is the **Variational Autoencoder** --- ### The vanilla VAE - $p_{\theta}(z)=\mathcal{N}(0,I)$ (thus the prior has no parameters) - $p_{\theta}(x\vert z)=\mathcal{N}(x;\mu_{\theta}(z),\boldsymbol{\sigma}^2_{\theta}(z)\odot I))$ where $\mu_{\theta}(z)$, $\sigma_{\theta}^2(z)$ are NN. For a latent sample $z$, this neural network _decodes_ the parameters of $p_{\theta}(x\vert z)$ which give optimal input reconstruction (**decoder**). - $q_{\phi}(z\vert x)=\mathcal{N}(z;\mu_{\phi}(x),\boldsymbol{\sigma}^2_{\phi}(x)\odot I)$ (same as for the decoder). For an input sample $x$, this neural network learns the variational parameters which correspond to an optimal _latent code_ $z$ (**encoder**) <img width="600" src="https://i.imgur.com/YC1zAsj.png"> ---- **Note**: even if $p_{\theta}(z)$ and $p_{\theta}(x\vert z)$ are multivariate Gaussian, this doesn't prevent $p_{\theta}(x)$ from being very complex, because according to Eq. $\ref{a}$ it's a mixture of an _infinite_ number of Gaussians. --- #### Learning the VAE - The weights of both neural networks are learnt at the same time using AEVB: note that with this simple choice of $p_{\theta}(z)$ and $q_{\phi}(z\vert x)$, the term $D_{KL}[q_{\phi}(z\vert x_i)\vert\vert p_{\theta}(z)]$ (the regularization term) has an analytical expression, thus the Monte Carlo estimate is only needed for the reconstruction term and its gradient. #### Generating samples - At inference time, we sample a latent code $z\sim\mathcal{N}(0,I)$ and then we propagate it through the decoder, thus the encoder is not used anymore. --- ### Experimental results The quality of the samples generated by the original VAE on MNIST have the classical, blurry "VAE" feeling: <img width="400" src="https://i.imgur.com/40YNAg0.png"> --- More recent results training the Bidirectional-Inference Variational Autoencoder (BIVA) ([Maaløe et al., 2019](https://arxiv.org/abs/1902.02102)) on CelebA are much better: <img width="1000" src="https://i.imgur.com/wk54TIH.png"> --- The Two-Stage VAE [Dai & Wipf, 2019](https://arxiv.org/abs/1903.05789) is even better ![](https://i.imgur.com/NfZ8yJV.png) But it still doesn't match the performance of models such as StyleGAN or BigGAN. But don't lose hope! --- The impressive DeepMind VQ-VAE-2 from [Razavi et al., 2019](https://arxiv.org/abs/1906.00446) not only generates images of comparable visual quality to the most advanced GAN models... ![](https://i.imgur.com/to6HuLc.png =300x) ...but it also performs better according to the newly introduced Classification Accuracy Score (CAS): ![](https://i.imgur.com/c13jjeJ.png =400x) --- ### VAE can write! (sort of) The topic-guided variational autoencoder (TGVAE) [Wang et al., 2019](https://arxiv.org/abs/1903.07137) was presented at the ACL 2019, and it's a nice example of a Language Model implemented using topic modelling and VAE ![](https://i.imgur.com/FlXIWtO.png) --- ### Wrap up - the VAE is a deep latent variable model, which can learn complex input distributions - the training algorithm is a specific instantiation of a more general algorithm, the AEVB, which is a "Swiss Army knife" to learn models with intractable likelihood over large datasets - the (vanilla) VAE is an instantiation of AEVB, where all the distributions involved are parametrized as multivariate Gaussians - with respect to other generative models (GANs), the VAE enjoys stable training, and an interpretable encoder/inference network, but a lower sample quality - However, latest VAE models reduce the quality gap considerably, and seem to be more effective at representation learning for downstream tasks, at least as measured from the CAS metric - VAE rocks :tada: --- # Thank you for your attention! :sheep: ###### Original blog post with some more details: https://hackmd.io/JvOUGcFqR9SWpdA4P8FlLQ?view