# Variational AutoEncoder ###### tags:`Bayesian` `Variational Inference` `Auto-Encoder` ### --- Summarised by Cheng-Yu Hung --- ## Reference - [Diederik P. Kingma and Max Welling (2019), “An Introduction to Variational Autoencoders”](https://arxiv.org/pdf/1906.02691.pdf) - [The variational auto-encoder ](https://ermongroup.github.io/cs228-notes/extras/vae/) ## Probabilistic Models We assume the observed variable $x$ is a random sample from an **unknown underlying process**, whos'e true distribution $p^*(x)$ is unknown. We attempt to approximate this underlying process with a chosen model $p_\theta(x)$, with parameters $\theta$: $$x\sim p_\theta(x)$$ Learning is, most commonly, the process of searching for a value of the parameters $\theta$ such that the probability distribution function given by the model, $p_\theta(x)$, approximates the true distribution of the data, denoted by $p^*(x)$, such that for any observed $x$: $$p_\theta(x)\approx p^*(x)$$ Natuarlly, we wish $p_\theta(x)$ to be sufficiently **flexible** to be able to adapt to the data, such that we have a chance of obtaining a sufficiently accurate model. At the same time, we wish to be able to incorporate knowledge about the distribution of data into the model that is known a priori. ## Deep Latent Variable Models (DLVM) *[DLVM]: Deep Latent Variable Models *[DLVMs]: Deep Latent Variable Models Consider a directed latent-variable model of the form $$ p_\theta(x,z) = p_\theta(x|z)p_\theta(z) $$ with observed data $x\in \mathcal{X}$, where $\mathcal{X}$ can be continuous or discrete and latent $z\in \mathbb{R}^k$. > The distribution $p_\theta(z)$ is often called the **prior distribution** over $z$, since it is not conditioned on any observation. To make this clear, you my think of $x$ being an image, and $z$ as latent factors that explain features of the face. For example, one coordinate of $z$ can encode whether the face is male or female. ### Intractabilities The main difficulty of maximum likelihood learning in DLVMs is that the marginal probability of data under the model is typically intractable. This is due to the integral in equation below: $$ p_\theta(x) = \int p_\theta(x,z)dz = \int p_\theta(x|z)p(z)dz $$ which is not having an analytic solution or efficient estimator. ### Big Data the dataset $\mathcal{D}$ we considered is too large for traditional approximate inference techniques such as MCMC or traditional inference methods. ## Evidence Lower Bound (ELBO) *[ELBO]: Evidence Lower Bound Let $p_\theta(x,z)$ be a latent-variable model with observed variables $x$ and latent variables $z$. We posit a family of approximate family of density $\{q_\phi(z|x)\}$. This is a set of densities over the latent variables. Each $q_\phi(z|x)$ is a candidate approximation to the exact conditional. Our goal is to find the best candidate, the one closest in $KL$ divergence to the exact conditional, which is: $$\underset{\phi}{\arg\min}\;D_{KL}(q_\phi(z|x)\Vert p_\theta(z|x))$$ However, this objective is not computable because it requires computing the evidence $\log p(x)$. (That the evidence is hard to compute is why we appeal to approximate inference in the first place.) To see why, recall that $KL$ divergence is $$D_{KL}(q_\phi(z|x)\Vert p_\theta(z|x))=\mathbb{E}_{q_\phi(z|x)}[\log q_\phi(z|x)]- \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(z|x)] $$ Expand the conditional, $$\begin{align} D_{KL}(q_\phi(z|x)\Vert p_\theta(z|x))&=\mathbb{E}_{q_\phi(z|x)}[\log q_\phi(z|x)]- \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x,z)] + \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x)]\\ &= \mathbb{E}_{q_\phi(z|x)}[\log q_\phi(z|x)]- \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x,z)] +\log p_\theta(x) \end{align} $$ This reveals its dependence on $\log p_\theta(x)$. Because we cannot compute the $KL$ divergence, we optimize an alternative objective that is equivalent to the $KL$ divergence up to an added constant, $$ \mathcal{L}_{\theta,\phi}(x)= \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x,z)] - \mathbb{E}_{q_\phi(z|x)}[\log q_\phi(z|x)] $$ This function is called the **evidence of lower bound** (ELBO), which satisfies the equation: $$ \log p_\theta(x) = \mathcal{L}_{\theta,\phi}(x) + D_{KL}(q_\phi(z|x)\Vert p_\theta(z|x)) $$ Due to the non-negativity of the $KL$ divergence, the ELOB is a lower bound on the log-likelihood of the data. $$\begin{align} \tag{1}\mathcal{L}_{\theta,\phi}(x)&= \log p_\theta(x)-D_{KL}(q_\phi(z|x)\Vert p_\theta(z|x))\\ &\leq \log p_\theta(x) \end{align}$$ So, interestingly, the $KL$ divergence $D_{KL}(q_\phi(z|x)\Vert p_\theta(z|x))$ determines two 'distances': 1. By definition, the $KL$ divergence of the approximate posterior from the true posterior; 2. The gap between the ELBO $\mathcal{L}_{\theta,\phi}(x)$ and the marginal likelihood $\log p_\theta(x)$; this is also called the **tightness** of the bound. The better $q_\phi (z|x)$ appriximates the true(posterior) distribution $p_\theta (z|x)$, in terms of the $KL$ divergence, the smaller the gap. ### A Double-Edge Sword By looking at equation $(1)$, it can be understood that maximization of the ELBO $\mathcal{L}_{\theta,\phi}(x)$ w.r.t. the parameter $\theta$ and $\phi$, will concurrently optimize the two things we care about: 1. It will approximately maximize the marginal likelihood $p_\theta(x)$. This means that our generative model will become better. 2. It will minimize the $KL$ divergence of the approximation $q_\phi (z|x)$ from the true posterior $p_\theta(z|x)$, so $q_\phi(z|x)$ becomes better. ### Black-box variational inference The first important idea is a general purpose approach for optimizing $q_\phi$ that works for large classes of $q_\phi$ (that are more complex than in mean field). We later combine this algorithm with specific choices of $q_\phi$. This approach - called *black-box variational inference*[^1] - consists of maximizing the ELBO using *gradient descent* over $\phi$ (instead of e.g., using a coordiante descent algorithm like mean field). Hence, it only assume that $q_\phi$ is differentiable in its parameter $\phi$. ## Stochastic Gradient-Based Optimization of the ELBO An important property of the ELBO, is that it allows joint optimization w.r.t. all parameters ($\theta$ and $\phi$) using stochastic gradient descent (SGD). We can start out with random initial values of $\phi$ and $\theta$, and stochastically optimize their vales until convergence. Given a dataset with i.i.d. data, the ELBO objective is the sum (or average) of individual-datapoint ELBO's: $$ \mathcal{L}_{\theta,\phi}(\mathcal{D})=\sum_{x\in \mathcal{D}}\mathcal{L}_{\theta,\phi}(x) $$ The individual-datapoint ELBO, and its gradient $\nabla_{\theta,\phi}\mathcal{L}_{\theta,\phi}(x)$ is, in general, intractable. However, good unbiased estimators $\tilde{\nabla}_{\theta,\phi}\mathcal{L}_{\theta,\phi}(x)$ exist, as we will show, such that we can still perform minibatch SGD. Unbiased gradients of the ELBO w.r.t. $\theta$ are simple to obtain: $$\begin{align} \nabla_{\theta}\mathcal{L}_{\theta,\phi}(x)&= \nabla_{\theta}\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x,z)-\log q_\phi(z|x)]\\ \tag{2}&= \mathbb{E}_{q_\phi(z|x)}[\nabla_{\theta}(\log p_\theta(x,z)-\log q_\phi(z|x))]\\ &\simeq \nabla_{\theta}(\log p_\theta(x,z)-\log q_\phi(z|x))\\ \tag{3}&=\nabla_{\theta}(\log p_\theta(x,z)) \end{align}$$ The last line $(3)$ is a simple Monte Carlo estimator of the second line $(2)$, where $z$ in $(3)$, the last line, is a random sample from $q_\phi(z|x)$. However, unbiased gradients w.r.t. $\phi$ are more difficult to obtain, since the ELBO's expectation is taken w.r.t. he distribution $q_\phi(z|x)$, which is a function of $\phi$. I.e., in general: $$\begin{align} \nabla_{\phi}\mathcal{L}_{\theta,\phi}(x)&= \nabla_{\phi}\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x,z)-\log q_\phi(z|x)]\\ &\neq \mathbb{E}_{q_\phi(z|x)}[\nabla_{\phi}(\log p_\theta(x,z)-\log q_\phi(z|x))] \end{align}$$ ### Score Function Estimator An alternative unbiased stochastic gradient estimator of the ELBO is the **score function estimator** [Kleijnen and Rubinstein, 1996]: $$\begin{align} \nabla_{\phi}\mathbb{E}_{q_\phi(z|x)}[f(z)]&=\mathbb{E}_{q_\phi(z|x)}[f(z)\nabla_{\phi}\log q_\phi(z|x)]\\ &\simeq f(z)\nabla_{\phi}\log q_\phi(z|x) \end{align}$$ where $z\sim q_\phi(z|x)$. This is also known as the *likelihood ratio* estimator [Glynn, 1990, Fu, 2006] and the REINFORCE gradient estimator [Williams, 1992]. The method has been successfully used in various methods like *neural variational inference* [Mnih and Gregor, 2014], *black-box variational inference* [Ranganath et al., 2014], *automated variational inference* [Wingate and Weber, 2013], and *variational stochastic search* [Paisley et al., 2012], often in combination with various novel control variate techniques [Glasserman, 2013] for variance reduction. Unfortunately, the score function estimator has an import shortcoming: it has a high variance. Suppose you are using Monte Carlo to estimate some quantity with expected value equal to $1$. If your samples are $0.9,1.1,0.96,1.05,\cdots$ and are close to $1$, then after a few samples, you will get a good estimate of the true expectation. If on the other hand you sample zero $99$ times out of $100$, and you sample $100$ once, the expectation is still correct, but you will have to take a very large number of samples to figure out that the true expectation is actually one. We refer to the latter case as being high variance. This is the kind of problem we often run into with the score function estimator. In fact, its variance is so high, that we cannot use it to learn many models. The key contribution of the VAE paper is to propose an alternative estimator that is much better behaved. This is done in two steps: 1. we first reformulate the ELBO so that parts of it can be computed in closed form (without Monte Carlo) 2. and then we use an alternative gradient estimator, based on the so-called reparametrization trick. ## The Stochastic Gradient Variational Bayes (SGVB) estimator The reformulation of the ELBO is as follow: $$\begin{align} \mathcal{L}_{\theta,\phi}(x)&= \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x,z)] - \mathbb{E}_{q_\phi(z|x)}[\log q_\phi(z|x)]\\ &= \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]+ \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(z)]- \mathbb{E}_{q_\phi(z|x)}[\log q_\phi(z|x)]\\ &=\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]-\mathbb{E}_{q_\phi(z|x)}\bigg[\log \frac{q_\phi(z|x)}{p_\theta(z)}\bigg]\\ &=\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]-D_{KL}(q_\phi(z|x)\Vert p_\theta(z)) \end{align}$$ This reparameterization has a very interesting interpretation. First, think of $x$ as an observed data point. The right-hand side consists of two terms; both involve taking a sample $z\sim q(z|x)$, which we can interpret as a code describing $x$. $q_\phi(z|x)$ is thus called the **encoder**, which is also called **inference model**. The first term $\log p_\theta(x|z)$ is the log-likelihood of the observed $x$ given the code $z$ that we have sampled. This term is maximized when $p_\theta(x|z)$ assigns high probability to the original $x$. It is trying to reconstruct $x$ given the code $z$; for that reason we call $p_\theta(x|z)$ the **decoder** network and the term is called the **reconstruction error**. The second term is the divergence between $q_\phi(z|x)$ and the prior $p_\theta(z)$, which we will fix to be a unit normal. It encourages the code $z$ to look Gaussian. We call it the **regularization** term. It prevent $q_\phi(z|x)$ from simply encoding an identity mapping, and instead forces it to learn some more interesting representation (e.g. facial features in our first example). Thus, our optimization objective is trying to fit a $q_\phi(z|x)$ that will map $x$ into a useful latent space $z$ from which we are able to reconstruct $x$ via $p(x|z)$. This type of objective is reminiscent of **auto-encoder** neural networks. :::info An auto-encoder is a pair of neural networks, $f$, $g$ that are composed as $\bar{x} = f(g(x))$. They are trained to minimize the reconstruction error $\Vert \bar{x} -x \Vert$. In practice, $g(x)$ learns to embed $x$ in a latent space that often has an intuitive interpretation. ::: ## The Reparametrization Trick As we have seen earlier, optimizing our objective requires a good estimate of the gradient. The main technical contribution of the VAE paper is a low-variance gradient estimator based on the **reparametrization trick**. ![alt text](https://i.imgur.com/kmEZotc.jpg "Figure 1" =500x) :::success >Figure from reference [1] Figure 1: Illustration of the reparameterization trick. The variational parameters $\phi$ affect the objective $f$ through the random variable $z\sim q_\phi(z|x)$. We wish to compute gradients $\nabla_\phi f$ to optimize the objective with SGD. In the original form (left), we cannot differentiate $f$ w.r.t. $\phi$, because we cannot directly backpropagate gradients through the random variable $z$. We can 'externalize' the randomness in $z$ by re-parameterizing the variable as a deterministic and differentiable function of $\phi$, $x$, and a newly introduced random variable $\epsilon$. This allow us to 'backprop through $z$', and compute gradients $\nabla_\phi f$. ::: Under certain mild conditions, we may express the distribution $q_\phi(z|x)$ as the following two-step generative process. - First, we sample a noise variable $\epsilon$ from a simple distribution $p(\epsilon)$ like the standard Normal $\mathcal{N}(0,1)$ $$\epsilon \sim p(\epsilon)$$ - Then we apply a deterministic transformation $g_\phi(\epsilon,x)$ that maps the random noise $\epsilon$ into a more complex distribution. $$z = g_\phi(\epsilon,x)$$ For many interesting classes of $q_\phi$, it is possible to choose a $g_\phi(\epsilon,x)$, such that $z = g_\phi(\epsilon,x)$ will be distributed according to $q_\phi(z|x)$. Gaussian variables provide the simplest example of the reparameterization trick. Instead of writing $z\sim q_{\mu,\sigma}(z) = \mathcal{N}(\mu,\sigma)$, we may write $$z = g_\phi(\epsilon,x)=\mu+\epsilon\;\cdot\;\sigma,$$ where $\epsilon\sim \mathcal{N}(0,1)$. It is easy to check that the two ways of expressing the random variable $z$ lead to the same distribution. The biggest advantage of this approach is that we may now write the gradient of an expectation with respect to $q_\phi(z)$ (for any $f$) as $$\begin{align} \nabla_\phi\mathbb{E}_{q_\phi(z|x)}[f(z)] &= \nabla_\phi\mathbb{E}_{p(\epsilon)}[f(g_\phi(\epsilon,x))]\\ &= \mathbb{E}_{p(\epsilon)}[\nabla_\phi f(g_\phi(\epsilon,x))]\\ &\simeq \nabla_{\phi}f(g_\phi(\epsilon,x)) \end{align}$$ where the last line is a form of a simple Monte Carlo estimator. This approach has much lower variance[^2] than the score function estimator, and will enable us to learn models that we otherwise couldn't learn. ## Choosing $q$ and $p$ Until now, we didn't specify the exact form of $q$ or $p$. The best $q(z|x)$ should be able to approximate the true posterior $p(z|x)$. Similarly, $p(x|z)$ should be flexible enough to represent the richness of the data. For these reasons, we are going to parameterize $q$ and $p$ by **neural networks**. These are extremely expressive function approximators that can be efficiently optimized over large datasets. This choice also draws a fascinating bridge between classical machine learning methods (Approximate Bayesian inference in this case) and modern deep learning. But what does it mean to parameterize a distribution with a neural network? Let's assume again that $q(z|x)$ and $p(x|z)$ are Normal distributions; we may write them as $$q(z|x) = \mathcal{N}(z;\vec{\mu}(x),diag(\vec{\sigma}(x))^2)$$ where $\vec{\mu}(x)$ and $\vec{\sigma}(x)$ are deterministic vector-valued functions of $x$ parameterized by an arbitrary complex neural network. More generally, the same technique can be applied to any exponential family distribution by parameterizing the sufficient statistics by a function of $x$. ## The Variational AutoEncoder We are now ready to define the auto-encoding variational bayes (AEVB) and the variational auto encoder, its most popular instantiation. The AEVB algorithm is simply the combination of 1. the auto-encoding ELBO reformulation, 2. the black-box variational inference approach, 3. the reparametrization-based low-variance gradient estimator. It optimizes the auto-encoding ELBO using black-box variational inference with the reparameterized gradient estimator. This algorithm is applicable to any deep generative model $p_\theta$ with latent variables that is differentiable in $\theta$. A variational autoencoder uses the AEVB algorithm to learn a specific model $p$ using a particular encoder $q$. The model $p$ is parameterized as $$\begin{align} p(x|z) &= \mathcal{N}(x;\vec{\mu}(z),diag(\vec{\sigma}(z))^2),\\ p(z) &= \mathcal{N}(z;0,I) \end{align}$$ where $\vec{\mu}(z)$, $\vec{\sigma}(z)$ are parameterized by a neural network. The model $q$ is similarly parameterized as $$q(z|x) = \mathcal{N}(z;\vec{\mu}(x),diag(\vec{\sigma}(x))^2).$$ This choice of $p$ and $q$ allows to further simplify the auto-encoding ELBO. In particular, we can use a closed form expression to compute the regularization term, and we only use Monte-Carlo estimates for the reconstruction term. These expressions are given in the paper. We may interpret the variational autoencoder as a directed latent-variable probabilistic graphical model. We may also view it as a particular objective for training an auto-encoder neural network; unlike previous approaches, this objective derives reconstruction and regularization terms from a more principled, Bayesian perspective. --- # Appendix ## Regularization Term: $D_{KL}(q_\phi(z|x)\Vert p_\theta(z))$ Under Gaussian Distribution As we assume before, the usual choice is to say that $q_\phi(z|x)$ and $p_\theta(z)$ follow gaussian distribution, which can be computed in closed form as: $$\begin{align} D_{KL}(\mathcal{N}(\mu_0,\Sigma_0)\Vert\mathcal{N}(\mu_1,\Sigma_1))=\frac{1}{2}(tr(\Sigma_1^{-1}\Sigma_0)+(\mu_1-\mu_0)^T\Sigma_0^{-1}(\mu_1-\mu_0)-k+\log (\frac{det\Sigma_1}{det\Sigma_0})) \end{align}$$ where $k$ is the dimensionality of the distribution. In our case, this simplifies to: $$\begin{align} D_{KL}(\mathcal{N}(\mu,\Sigma))\Vert\mathcal{N}(0,I))=\frac{1}{2}(tr(\Sigma)+\mu^T\mu-k-\log (det\Sigma)) \end{align}$$ Moreover, if $q(z|x) = \mathcal{N}(z;\vec{\mu}(x),diag(\vec{\sigma}(x))^2)$, then it can be simplified as: $$\begin{align} D_{KL}(\mathcal{N}(\mu,\Sigma))\Vert\mathcal{N}(0,I))=\frac{1}{2}\bigg(\sum_{i=1}^{k}\bigg(\mu_i^2+\sigma_i^2-1-\log(\sigma_i^2)\bigg)\bigg). \end{align}$$ [^1]: The term *black-box variational inference* was first coined by **Ranganath et al.**, with the ideas being inspired by earlier work, such as the Wake-sleep algorithm. [^2]: For more details as to why, have a look at the appendix of the paper by **Rezende et al**.