HackMD
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    Variational autoencoders 1 - the basics == $$\DeclareMathOperator\supp{supp} \DeclareMathOperator*{\argmax}{arg\,max}$$ The Variational Autoencoder (VAE) is a not-so-new-anymore Latent Variable Model ([Kingma & Welling, 2014](https://arxiv.org/pdf/1312.6114.pdf)), which by introducing a probabilistic interpretation of autoencoders, allows to not only estimate the variance/uncertainty in the predictions, but also to inject domain knowledge through the use of informative priors, and possibly to make the latent space more interpretable. VAEs have various applications: - density estimation and sampling from the estimated density, i.e., data generation (e.g., image/sound/text generation and _missing data imputation_)[^1] - semi-supervised learning - representation learning for downstream tasks ## VAEs are generative models ## Unlike classical (sparse, denoising, etc.) autoencoders, VAEs are probabilistic *generative* models, like GANs (Generative Adversarial Networks). With *generative* we denote a model which learns the probability distribution $p(\mathbf{x})$ over the input space $\mathcal{X}$. This means that after training such a model, we can then sample from our approximation of $p(\mathbf{x})$. If our training set contains handwritten digits ([MNIST](http://yann.lecun.com/exdb/mnist/)), then the trained generative model is able to create images which look like handwritten digits, even though they're not "copies" of the images in the training set. In the case of images, if our training set contains natural images ([CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html)) or celebrity faces ([CelebA](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html)), it will generate images which look like photo pics. Learning the distribution of the images in the training set implies that images which look like handwritten digits (for example) have an high probability of being generated, while images which look like the Jolly Roger or random noise have a low probability. In other words, it means learning about the dependencies among pixels: if our image is a $28\times 28=784$ pixels grayscale image from MNIST, the model should learn that if a pixel is very bright, then there's a significant probability that some neighboring pixels are bright too, that if we have a long, slanted line of bright pixels we may have another smaller, horizontal line of pixels above this one (a 7), etc. ![](https://i.imgur.com/n8tVcHc.png) ## VAEs are Latent Variable models ## The VAE is a _Latent Variable Model_ (LVM): this means that $\mathbf{x}$, the random vector of the 784 pixel intensities (the _observed_ variables), is modeled as a (possibly very complicated) function of a random vector $\mathbf{z}\in\mathcal{Z}$ of lower dimensionality, whose components are unobserved (_latent_) variables. Coincisely, we assume a data generating process with the following DAG (Directed Acyclic Graph): $$ \begin{align} \mathbf{z} &\sim p_{\theta^*}(\mathbf{z}) \\ \mathbf{x}\vert\mathbf{z} &\sim p_{\theta^*}(\mathbf{x}\vert\mathbf{z}) \end{align}$$ ![](https://i.imgur.com/gnwIrgv.png) And we assume that both the prior $p_{\theta^*}(\mathbf{z})$ and the likelihood $p_{\theta^*}(\mathbf{x}\vert\mathbf{z})$ come from some parametric families (possibly different). Latent variable models are also sometimes called _hierarchical_ or _multilevel_ models, and they are models that use the rules of conditional probability to specify complicated probability distributions over high dimensional spaces, by composing simpler probability density functions. The original VAE has only two levels, but recently [a paper on a deep hierarchial VAE](https://arxiv.org/abs/1902.02102) with multiple levels of latent variables has been published (note that the hierarchy of latent variables in a probabilistic model has nothing to do with the layers of a neural network). When does a latent variable model make intuitive sense? For example, in the MNIST case we think that the handwritten digits belong to a manifold of dimension much smaller than the dimension of $\mathbf{x}$, because the vast majority of random arrangements of 784 pixel intensities, don't look at all like handwritten digits. The representation of the image should be equivariant to certain transformations (e.g., rotations, translations and small deformations) but not to others. So in this case it makes sense to think that the image samples are generated by taking samples (_which we don't observe_) in a sample space of much smaller dimension, and then transforming them according to some complicated function. ## Estimating the model We are now given a dataset consisting of $N$ i.i.d. samples, $\mathbf{X}=\{\mathbf{x}_i\}_{i=1}^N$ (note again that we do **not** observe $\mathbf{z}$) and we want to estimate $\theta^*$. The standard "recipe" would be to maximize the likelihood of the data (marginalized over the latent variables), i.e., Maximum Likelihood Estimation: $$ \hat{\theta}=\argmax_{\theta\in\Theta}p_{\theta}(\mathbf{X}) =\argmax_{\theta\in\Theta}\sum_{i=1}^N\log_{\theta}{p(\mathbf{x}_i)} $$ However, we're immediately faced with two challenges: 1. computing $p_{\theta}(\mathbf{x}_i)$ requires solving the integral $$ p_{\theta}(\mathbf{x}_i) = \int p_{\theta}(\mathbf{x}_i\vert\mathbf{z})p_{\theta}(\mathbf{z})d\mathbf{z} \label{a}\tag{1} $$ which is often intractable [^2], even in the simple case of one hidden layer nonlinear neural networks. 2. we need to compute the integral for _all_ $N$ data samples $\mathbf{x}_i$ of a large dataset, which rules out either batch optimization, or sampling-based solutions such as [Monte Carlo EM](https://arxiv.org/abs/1206.4768), which would require an expensive sampling step for each $\mathbf{x}_i$. We need to get _waaay_ smarter. ### Enter Variational Inference ### To solve the first problem, i.e., the intractability of the marginal likelihood, we use an age-old (in ML years) trick: Variational Inference (introduced in [Jordan et al., 1999](https://people.eecs.berkeley.edu/~jordan/papers/variational-intro.pdf) and nicely reviewed in [Blei et al., 2018](https://arxiv.org/pdf/1601.00670.pdf)). VI transforms the inference problem into an optimization problem. This way, we get rid of integration, but we pay a price, as we'll see later. **Fair warning**: this section is more mathy than the rest. I'll introduce necessary concepts such as the [Kullback-Leibler divergence](https://en.wikipedia.org/wiki/Kullback-Leibler_divergence), but you may want to skip to the next session, for a first read. #### The Kullback-Leibler divergence VI starts by introducing a family $\mathcal{Q}$ of parametric approximations $q_{\phi}(\mathbf{z}\vert\mathbf{x})$ to the true posterior distribution $p_{\theta}(\mathbf{z}\vert\mathbf{x})$, indexed by $\phi$. The goal of VI is to find the value(s) of $\phi$ such that $q_{\phi}(\mathbf{z}\vert\mathbf{x})\in\mathcal{Q}$ is "closest" to $p_{\theta}(\mathbf{z}\vert\mathbf{x})$ in some sense. We only assume that $q_{\phi}(\mathbf{z}\vert\mathbf{x})$ is differentiable with respect to $\phi$ (_black box VI_), but we don't make further assumptions such as for example assuming a fully factorized form for $q_{\phi}$, as in the case of mean field variational inference. The similarity measure we we seek to minimize is the _Kullback-Leibler divergence_ $$D_{KL}(q_{\phi}(\mathbf{z}\vert\mathbf{x})\vert\vert p_{\theta}(\mathbf{z}\vert\mathbf{x}))=\int q_{\phi}(\mathbf{z}\vert\mathbf{x}) \log{\frac{q_{\phi}(\mathbf{z}\vert\mathbf{x})}{p_{\theta}(\mathbf{z}\vert\mathbf{x})}}d\mathbf{z} $$ which is a similarity measure for probabilities distributions. For the following, we'll need two properties of $D_{KL}$: 1. $D_{KL}(q\vert\vert p)\ge0 \ \forall p,q$ 2. $D_{KL}(q\vert\vert p) = 0 \iff p = q \ \text{a.e.}$ #### Some ELBO grease However, our _primary_ goal is to estimate the generative model parameters $\theta^*$ through Maximum (Marginal) Likelihood Estimation. Thus, let's try to rewrite the marginal log-likelihood of a data point in terms of $D_{KL}(q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\vert\vert p_{\theta}(\mathbf{z}\vert\mathbf{x}_i))$: $$ \begin{multline}\log{p_{\theta}(\mathbf{x}_i)} = \log{p_{\theta}(\mathbf{x}_i)}\int q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)d\mathbf{z} = \int q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\log{p_{\theta}(\mathbf{x}_i)} d\mathbf{z} = \\ \int q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\log{\frac{p_{\theta}(\mathbf{x}_i, \mathbf{z})}{p_{\theta}(\mathbf{z}\vert\mathbf{x}_i)}} d\mathbf{z} = \int q_{\phi}(\mathbf{z}\vert\mathbf{x}_i) \log{\frac{p_{\theta}(\mathbf{x}_i, \mathbf{z})}{p_{\theta}(\mathbf{z}\vert\mathbf{x}_i)}\frac{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}} d\mathbf{z} \end{multline}$$ where in the second-to-last inequality we used Bayes' rule, and the last equality holds if $\supp {p_{\theta}(\mathbf{x}_i, \mathbf{z})}\subset\supp{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}$ (otherwise it becomes an inequality). We can now see a term similar to $D_{KL}(q_{\phi}(\mathbf{z}\vert\mathbf{x})\vert\vert p_{\theta}(\mathbf{z}\vert\mathbf{x}))$ inside the integral, so let's pop it out: $$ \begin{multline}\log{p_{\theta}(\mathbf{x}_i)} = \int q_{\phi}(\mathbf{z}\vert\mathbf{x}_i) \log{\frac{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}{p_{\theta}(\mathbf{z}\vert\mathbf{x}_i)}\frac{p_{\theta}(\mathbf{x}_i, \mathbf{z})}{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}}d\mathbf{z} = \\ \int q_{\phi}(\mathbf{z}\vert\mathbf{x}_i) \log{\frac{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}{p_{\theta}(\mathbf{z}\vert\mathbf{x}_i)}}d\mathbf{z}+\int q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\log{\frac{p_{\theta}(\mathbf{x}_i, \mathbf{z})}{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}} d\mathbf{z}=\\ D_{KL}(q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\vert\vert p_{\theta}(\mathbf{z}\vert\mathbf{x}_i))+ \mathcal{L}(\phi,\theta;\mathbf{x}_i) \end{multline}$$ Summarizing: $$ \log{p_{\theta}(\mathbf{x}_i)} = D_{KL}(q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\vert\vert p_{\theta}(\mathbf{z}\vert\mathbf{x}_i))+ \mathcal{L}(\phi,\theta;\mathbf{x}_i) \label{b}\tag{2}$$ The term $\mathcal{L}(\phi,\theta;\mathbf{x}_i)$ is called the _**E**vidence_ (or _variational_) _**L**ower **BO**und_ (_**ELBO**_ for friends), because it's always _no greater than_ the marginal likelihood (or _evidence_) for datapoint $\mathbf{x}_i$, since $D_{KL}(q\vert\vert p)\ge0$: $$ \log{p_{\theta}(\mathbf{x}_i)} \geq \mathcal{L}(\phi,\theta;\mathbf{x}_i) $$ Thus, maximizing the ELBO goes into the direction of maximizing the marginal likelihood, our original goal. However, the ELBO doesn't contain the pesky marginal evidence, which is what made the problem intractable. Dropping the $D_{KL}(q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\vert\vert p_{\theta}(\mathbf{z}\vert\mathbf{x}_i))$ and summing on all data points, the learning objective of **BBVI** (Black Box Variational Inference) becomes $$ \max_{\theta}\sum_{i=1}^N\max_{\phi}\mathcal{L}(\phi,\theta;\mathbf{x}_i) \label{c}\tag{3}$$ Summarizing, we learn a LVM by maximizing the ELBO with respect to both the model parameters $\theta$ and the variational parameters $\phi_i$ for each data point $\mathbf{x}_i$. **Note** that until now, we haven't mentioned either neural networks or VAE. The approach has been very general, and it could apply to any Latent Variable Model which has the DAG representation shown above. The ELBO can be written as $$\int q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\log{\frac{p_{\theta}(\mathbf{x}_i, \mathbf{z})}{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}} d\mathbf{z}=\mathbb{E}_{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}[\log{p_{\theta}(\mathbf{x}_i, \mathbf{z})}-\log{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}] \label{d}\tag{4}$$ or alternatively: $$\begin{multline} \int q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\log{\frac{p_{\theta}(\mathbf{x}_i, \mathbf{z})}{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}} d\mathbf{z}=\int q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\log{\frac{p_{\theta}(\mathbf{x}_i| \mathbf{z})p_{\theta}(\mathbf{z})}{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}} d\mathbf{z} = \\ \int q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\log{p_{\theta}(\mathbf{x}_i\vert\mathbf{z})}d\mathbf{z} - \int q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\log{\frac{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}{p_{\theta}(\mathbf{z})}} d\mathbf{z} =\\ = \mathbb{E}_{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}[\log{p_{\theta}(\mathbf{x}_i\vert\mathbf{z})}] - D_{KL}(q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\vert\vert p_{\theta}(\mathbf{z})) \end{multline} \label{e}\tag{5}$$ This last form lends itself to some nice interpretations, as we'll see later[^3]. #### Bend your ELBO We list here some useful properties of the ELBO. Not all of them will be needed in the following, but we list them anyway as an useful companion for reading other tutorials or papers on VAEs. - for each data point $\mathbf{x}_i$, the true posterior distribution $p_{\theta}(\mathbf{z}\vert\mathbf{x}_i)$ may or may not belong to $\mathcal{Q}$. If it does, then property 2 of $D_{KL}$ implies that $\mathcal{L}(\phi_i,\theta;\mathbf{x}_i)$ is maximized for $\phi_i=\phi_i^*\mid q_{\phi_i^*}(\mathbf{z}\vert\mathbf{x}_i)=p_{\theta}(\mathbf{z}\vert\mathbf{x}_i)$, and maximizing the ELBO is equivalent to maximizing the marginal likelihood. If it doesn't, then, even for the optimal value $\phi^*_i$, there will be a nonzero difference between the ELBO and the actual marginal likelihood of the data point, equal to $D_{KL}(q_{\phi^*_i}(\mathbf{z}\vert \mathbf{x}_i)\vert\vert p_{\theta}(\mathbf{z}\vert \mathbf{x}_i))$. ![](https://i.imgur.com/q6XMxtY.png) ###### adapted from https://deepgenerativemodels.github.io/notes/vae/ The residual error is called _approximation gap_ ([Cremer et al., 2017](https://arxiv.org/abs/1801.03558)) and it can only be reduced by expanding the variational family $\mathcal{Q}$ ([Kinga et al., 2016](https://arxiv.org/abs/1606.04934), [Tomczak & Welling, 2017](https://arxiv.org/abs/1706.02326) and many others). However, overly flexible inference models can also have side effects ([Rui et al., 2018](https://arxiv.org/abs/1805.08913)). - combining Eq.$\ref{c}$ and Eq.$\ref{e}$, we see that maximizing the term $\sum_{i=1}^N\mathbb{E}_{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}[\log{p_{\theta}(\mathbf{x}_i\vert\mathbf{z})}]$ maximizes the probability that, under repeated sampling from $\mathbf{z}\sim q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)$, we generate samples $\mathbf{x}$ which are similar to the training samples $\mathbf{x}_i$. For this reason it's often interpreted as a _reconstruction quality_ (or its opposite is interpreted as a reconstruction loss). The term $-D_{KL}(q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\vert\vert p_{\theta}(\mathbf{z}))$ penalizes the flexibile $q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)$ for being too dissimilar from the prior $p_{\theta}(\mathbf{z})$. In other words, it's a _regularizer_. Thus the ELBO can be interpreted as the sum of a reconstruction error term, and a regularizer term. This is legit, but let's not forget that the goal of VI is still to **maximize the marginal likelihood of the training data**. ### The SGVB estimator and the AEVB algorithm Introducing BBVI, we solved intractability, but we still need a way to maximize the BBVI objective in a scalable way. To this end, we introduce the Stochastic Gradient-based estimator for the ELBO and its gradient with respect to $\theta$ and $\phi$, called Stochastic Gradient Variational Bayes (SGVB). We want to use stochastic gradient ascent, thus we need the gradient of Eq.$\ref{d}$: $$ \nabla_{\theta,\phi}\mathbb{E}_{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}[\log{p_{\theta}(\mathbf{x}_i, \mathbf{z})}-\log{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}] $$ The gradient with respect to $\theta$ is immediate: $$ \mathbb{E}_{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}[\nabla_{\theta}\log{p_{\theta}(\mathbf{x}_i, \mathbf{z})}] $$ and we can estimate the expectation using Monte Carlo. The gradient with respect to $\phi$ is more badass: since the expectation and the gradient are both with respect to $q_{\phi}$, we cannot simply swap them. As shown in [Mnih & Gregor, 2014](https://www.cs.toronto.edu/~amnih/papers/nvil.pdf), we can prove that $$ \nabla_{\phi}\mathbb{E}_{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}[\log{p_{\theta}(\mathbf{x}_i, \mathbf{z})}-\log{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}] = \mathbb{E}_{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}[(\log{p_{\theta}(\mathbf{x}_i, \mathbf{z})}-\log{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)})\nabla_{\phi}\log{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}]$$ Again, now that the gradient is inside the expectation, we could use Monte Carlo to estimate it. However, the resulting estimator, called the _score function_ estimator, has high variance. The key contribution of Kingma and Welling, 2014, is an alternative estimator, which is much more well-behaved: the SGVB estimator, based on the _reparametrization trick_. For many differentiable parametric families, it's possible to draw samples of the random variable $\tilde{\mathbf{z}}\sim q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)$ with a two-step generative process: - generate samples $\boldsymbol{\epsilon} \sim p(\boldsymbol{\epsilon})$ from a simple distribution $p(\boldsymbol{\epsilon})$, such as $\mathcal{N}(0,I)$ - apply a differentiable, deterministic function $g_{\phi}(\boldsymbol{\epsilon}, \mathbf{x})$ to the random noise $\boldsymbol{\epsilon}$. The resulting random variable $\tilde{\mathbf{z}} = g_{\phi}(\boldsymbol{\epsilon}, \mathbf{x})$ is indeed distributed as $q_{\phi}(\mathbf{z}\vert\mathbf{x})$ A classic example is the univariate Gaussian. Let $q_{\phi}(z\vert x) = q_{\mu,\sigma}(z)=\mathcal{N}(\mu,\sigma)$. Then of course if $\epsilon\sim \mathcal{N}(0,1)$ and $g_{\phi}(s)=\mu+\sigma s$, we know that $z=g_{\phi}(\epsilon)\sim\mathcal{N}(\mu,\sigma)$, as desired. There are many other families which can be similarily reparametrized. Another way to see the relevance of the reparametrization trick, is to note that by moving the random node outside the part of the computational graph containing the variational parameters, it allows us to backpropagate through the graph. See this image from a presentation of Durk Kingma: <p align="center"> <img width="600" src="https://i.imgur.com/xvH1onJ.png"> </p> The biggest selling point of the reparametrization trick is that we can now write the gradient of the expectation with respect to $q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)$ of any function $f(\mathbf{z})$ as $$ \nabla_{\phi}\mathbb{E}_{q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)}[f(\mathbf{z})]=\nabla_{\phi}\mathbb{E}_{p(\boldsymbol{\epsilon})}[f(g_{\phi}(\boldsymbol{\epsilon},\mathbf{x}_i))]=\mathbb{E}_{p(\boldsymbol{\epsilon})}[\nabla_{\phi}f(g_{\phi}(\boldsymbol{\epsilon},\mathbf{x}_i))]$$ Using Monte Carlo to estimate this expectation we obtain an estimator (the SGVB estimator) which has lower variance than the score function estimator[^4], allowing us to learn more complicated models. The reparametrization trick was then extended to discrete variables ([Maddison et al, 2016](https://arxiv.org/abs/1611.00712), [Jang et al., 2016](https://arxiv.org/abs/1611.01144)), which allowed training VAE with discrete latent variables. None of these works, though, closed the performance gap of VAEs with continuous latent variables. This has been recently solved by [van den Oord et al., 2017](https://arxiv.org/abs/1711.00937), with their famous VQ-VAE. SGVB allows us to estimate the ELBO _for a single datapoint_, but we need to estimate it for all $N$. To do this, we use minibatches of size $M$: the gradients are computed with automatic differentiation, and the parameter values are updated with some gradient ascent method, such as SGD, RMSProp or Adam. The combination of the SGVB estimator with the minibatch stochastic gradient on $\phi,\theta$ is the famous Auto-Encoding Variational Bayes (AEVB), which gives title to the Kingma and Welling paper. ![](https://i.imgur.com/15RcirL.png) ## Amortized Variational Inference, At this point, you may feel cheated: I haven't mentioned VAEs even once (except in the click-baity introduction). This was done on purpose, to show that the AEVB algorithm is much more general than just Variational Autoencoders, which are simply an instantiation of it. This gives you the possibility to use AEVB to learn more general models than just the original VAE. For example, now you know: - that $q_{\phi}(\mathbf{z})$ doesn't have to be in the Gaussian family - that you could in principle learn a _different_ $q_{\phi}(\mathbf{z})$ for each datapoint (that's why we condition on $\mathbf{x}_i$) - why the reparametrization trick is so great (it reduces the variance of the estimate of the gradient of the ELBO with respect to $\phi$) - why just creating a bigger & badder encoder-decoder NN won't , by itself, reduce the amortization gap Contrast that with some introductions which simply start from Eq.($\ref{e}$), and then dive straight into the architecture of encoder & decoder. Until now, we haven't specified the form of $p_{\theta}(\mathbf{z})$, $p_{\theta}(\mathbf{x}_i\vert\mathbf{z})$, and $q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)$. Also, we saw that we're learning a different variational (approximate) posterior $q_{\phi}(\mathbf{z}\vert\mathbf{x}_i$) for each datapoint. We could do that by using nonparametric density estimation, but of course, we wouldn't expect that to scale well to large datasets. It would be probably smarter to **learn** a mapping from the data points to the optimal variational parameters, $f_{\phi}:\mathcal{X}\to\mathcal{Q}$, for each value of $\theta$. Now, which tool do we know which allows to efficiently learn complicated, nonlinear mappings between high-dimensional spaces? Neural networks, of course! Of course, if at each $\phi$ optimization step we had to retrain the neural network over the whole data set, of course we wouldn't have saved computation: the effort would still be proportional to the size of the dataset. But in practice we interleave the optimization on $\theta$ and on $\phi$ over each minibatch. This way, by introducing neural networks we _amortized_ the cost of variational inference ($q_{\phi}(\mathbf{z}\vert\mathbf{x}_1),\dots,q_{\phi}(\mathbf{z}\vert\mathbf{x}_N)$). If we use neural networks also to parametrize $p_{\theta}(\mathbf{z})$ and $p_{\theta}(\mathbf{x}_i\vert\mathbf{z})$, the resulting model is called the **Variational Autoencoder**. In the Kingma & Welling paper, the following choices were made for the case of real data: - $p_{\theta}(\mathbf{z})=\mathcal{N}(0,I)$ (thus the prior has no parameters) - $p_{\theta}(\mathbf{x}\vert\mathbf{z})=\mathcal{N}(\mathbf{x};\mu_{\theta}(\mathbf{z}),\boldsymbol{\sigma}^2_{\theta}(\mathbf{z})\odot I))$ where $\mu_{\theta}(\mathbf{z})$, $\sigma_{\theta}^2(\mathbf{z})$ are two single hidden layer MLPs of 500 units each in the original paper. For a latent sample $z$, this neural network _decodes_ the parameters of $p_{\theta}(\mathbf{x}\vert \mathbf{z})$ which give optimal input reconstruction. For this reason, it's called the _decoder_[^5]. - $q_{\phi}(\mathbf{z}\vert\mathbf{x})=\mathcal{N}(\mathbf{z};\mu_{\phi}(\mathbf{x}),\boldsymbol{\sigma}^2_{\phi}(\mathbf{x})\odot I)$, so basically the same neural network as used for the decoder. For an input sample $x$, this neural network learns the variational parameters which correspond to an optimal _latent code_ $z$, thus we call it the _encoder_. ![](https://i.imgur.com/YC1zAsj.png) #### Learning the VAE The weights of both neural networks are learnt at the same time using AEVB: note that with this simple choice of $p_{\theta}(\mathbf{z})$ and $q_{\phi}(\mathbf{z}\vert\mathbf{x})$, the term $D_{KL}(q_{\phi}(\mathbf{z}\vert\mathbf{x}_i)\vert\vert p_{\theta}(\mathbf{z}))$ (the regularization term) has an analytical expression, thus the Monte Carlo estimate is only needed for the reconstruction term and its gradient. #### Generating samples At inference time, we sample a latent code $\mathbf{z}\sim\mathcal{N}(0,I)$ and then we propagate it through the decoder, thus the encoder is not used anymore. ### Experimental results The quality of the samples generated by the original VAE on both MNIST have the classical, blurry "VAE" feeling: ![](https://i.imgur.com/40YNAg0.png) ###### from [Kingma & Welling, 2014](https://arxiv.org/pdf/1312.6114.pdf) More recent results training a deep hierarchical Variational Autoencoder ([Maaløe et al., 2019](https://arxiv.org/abs/1902.02102)) on CelebA are much better: ![](https://i.imgur.com/wk54TIH.png) We're still a far cry from GANs, but at least the performance of autoregressive models are now matched. The Two-Stage VAE [Dai & Wipf, 2019](https://arxiv.org/abs/1903.05789) is even better: ![](https://i.imgur.com/NfZ8yJV.png) This matches the quality of smaller GAN models, even though it's not as good as models like StyleGAN or BigGAN. The SenVAE [Pelsmaeker & Aziz, 2019](https://arxiv.org/abs/1904.08194) is a first step towards a usable VAE LM: ![](https://i.imgur.com/FlXIWtO.png) ### Wrapping up VAEs are deep latent variable models, which can learn complex input distributions. The training algorithm is a specific instantiation of a more general algorithm, the AEVB, which is a "Swiss Army knife" to learn models with intractable likelihood over large datasets. The (vanilla) VAE is an instantiation of AEVB, where all the distributions involved are parametrized as multivariate Gaussians. With respect to other generative models (GANs), the VAE enjoys stable training, and an interpretable encoder/inference network, but a lower sample quality. However, latest VAE models reduce the quality gap considerably. And they rock! :tada: [^1]: of course, none of these are the real reasons why people use VAE. VAE are cool because they are one of the very few cases in Deep Learning where theory actually informs architecture design (the only other which comes to my mind are [gauge equivariant neural networks](https://arxiv.org/pdf/1902.04615.pdf)). [^2]: if this is the first time you meet the term "intractable" and you're anything like me, you'll want to know what it actually means (roughly, that the expression takes exponential time to compute) and why _exactly_ this integral should be intractable. Read section 2.1 of [this great paper](https://arxiv.org/pdf/1601.00670.pdf) for the proof that even the marginal likelihood of a simple LVM such as a Gaussian mixture is intractable. [^3]: If at this point you feel a crushing weight over your soul, that's fine: it's just the normal reaction to seeing the derivation of the VAE objective function for the first time. If it's any consolation, just think that in their original paper, Kingma & Welling casually introduce the ELBO in the first equation after $\ref{a}$, and summarize all the VI part of the VAE model in a very short paragraph. Such are the heights to which great minds soar. [^4]: as proved in the appendix of [Rezende et al., 2014](https://arxiv.org/abs/1401.4082) [^5]: don't be fooled by the apparent simplicity of $p_{\theta}(\mathbf{z})$ and $p_{\theta}(\mathbf{x}\vert\mathbf{z})$. Even though they're "simple" multivariate gaussian (with a diagonal covariance matrix, even!), $p_{\theta}(\mathbf{x}) = \int p_{\theta}(\mathbf{x}\vert\mathbf{z})p_{\theta}(\mathbf{z})d\mathbf{z}$ is a mixture of _an infinite number of multivariate Gaussians_, each of them with its own mean and variance vectors. Thus it's a much more complicated and flexible distribution, with an essentially arbitrary number of modes, that can model such complicated data distributions as that of celebrity faces.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully