slides: https://hackmd.io/p/SkR7GhwqN#/
in practice, we assume a data generating process represented in the following DAG (Directed Acyclic Graph)
\[ \begin{align} z &\sim p_{\theta^*}(z) \\ x\vert z &\sim p_{\theta^*} (x\vert z) \end{align}\]
digraph G {
splines=line;
subgraph cluster1 {
node [style=filled, shape=circle];
edge [color=blue]
z[fillcolor=white, color=black]
z -> x;
}
theta[label = "θ", shape=circle]
edge [color=black, style="dashed"]
theta -> z [constraint=false]
theta -> x
}
Goal: given an iid dataset of size \(N\), \(X=\{x_i\}_{i=1}^N\), estimate \(\theta^*\)
standard approach: maximize data likelihood, marginalized over latent variables
\[ \hat{\theta}=\argmax_{\theta\in\Theta}p_{\theta}(X) =\argmax_{\theta\in\Theta}\sum_{i=1}^N\log{p_{\theta}(x_i)} \]
computing \(p_{\theta}(x_i)\) requires computing the intractable integral
\[ p_{\theta}(x_i) = \int p_{\theta}(x_i\vert z)p_{\theta}(z)dz \label{a}\tag{1} \]
the integral must be recomputed for all \(N\) samples of a large dataset. This rules out batch optimization or sampling-based solutions such as Monte Carlo EM
We need to get waaay smarter.
Goal: find \(\phi^*\) which minimizes the KL-divergence between \(q_{\phi^*}(z\vert x)\in\mathcal{Q}\) and \(p_{\theta}(z\vert x)\)
\[\begin{multline}\phi^* = \argmin_{\phi\in\Phi} D_{KL}[q_{\phi}(z\vert x)\vert\vert p_{\theta}(z\vert x)]= \\ \argmin_{\phi\in\Phi} \int q_{\phi}(z\vert x) \log{\frac{q_{\phi}(z\vert x)}{p_{\theta}(z\vert x)}}dz \end{multline}\]
we note two properties of \(D_{KL}\):
\[ \begin{align}
D_{KL}[q\vert\vert p] &\ge0 \ \forall p,q \label{b}\tag{2} \\
D_{KL}[q\vert\vert p] &= 0 \iff p = q \ \text{a.e.} \label{c}\tag{3} \end{align}\]
Our primary goal is still to estimate \(\theta^*\) through Maximum (Marginal) Likelihood Estimation
We can rewrite \(\log{p_{\theta}(x_i)}\) in terms of \(D_{KL}[q_{\phi}\vert\vert p_{\theta}]\) as
\[ \log{p_{\theta}(x_i)} = D_{KL}[q_{\phi}(z\vert x_i)\vert\vert p_{\theta}(z\vert x_i)]+ \mathcal{L}(\phi,\theta;x_i) \label{d}\tag{4}\]
\(\mathcal{L}(\phi,\theta;x_i) = \int q_{\phi}(z\vert x_i)\log{\frac{p_{\theta}(x_i, z)}{q_{\phi}(z\vert x_i)}} dz\) is called ELBO (Evidence Lower BOund)
see nested slides for proofs
\[ \begin{multline}D_{KL}[q_{\phi}(z\vert x_i)\vert\vert p_{\theta}(z\vert x_i)]= \mathbb{E}_{q_{\phi}(z\vert x_i)}\left[\log{\frac{q_{\theta}( z\vert x_i)}{p_{\theta}(z\vert x_i)}}\right] = \\ \mathbb{E}_{q_{\phi}(z\vert x_i)}\left[\log{\frac{q_{\theta}(z\vert x_i)p_{\theta}(x_i)}{p_{\theta}(x_i, z)}}\right]= \\ \mathbb{E}_{q_{\phi}(z\vert x_i)}\left[\log{\frac{q_{\theta}(z\vert x_i)}{p_{\theta}(x_i, z)}}\right]+\mathbb{E}_{q_{\phi}(z\vert x_i)}\left[\log{p_{\theta}(x_i)}\right]=\\ -\mathcal{L}(\phi,\theta;x_i)+\log{p_{\theta}(x_i)} \end{multline} \]
\[ \begin{multline}\log{p_{\theta}(x_i)} = \log{p_{\theta}(x_i)}\int q_{\phi}(z\vert x_i)dz = \\ \int q_{\phi}(z\vert x_i)\log{p_{\theta}( x_i)} dz = \\ \int q_{\phi}(z\vert x_i)\log{\frac{p_{\theta}( x_i, z)}{p_{\theta}(z\vert x_i)}} dz = \\ \int q_{\phi}(z\vert x_i) \log{\frac{p_{\theta}( x_i, z)}{p_{\theta}(z\vert x_i)}\frac{q_{\phi}(z\vert x_i)}{q_{\phi}(z\vert x_i)}} dz \end{multline}\]
\[ \begin{multline}\log{p_{\theta}( x_i)} = \int q_{\phi}(z\vert x_i) \log{\frac{q_{\phi}(z\vert x_i)}{p_{\theta}(z\vert x_i)}\frac{p_{\theta}( x_i, z)}{q_{\phi}(z\vert x_i)}}dz = \\ \int q_{\phi}(z\vert x_i) \left( \log{\frac{q_{\phi}(z\vert x_i)}{p_{\theta}(z\vert x_i)}}dz+\log{\frac{p_{\theta}( x_i, z)}{q_{\phi}(z\vert x_i)}} \right) dz= \\ D_{KL}[q_{\phi}(z\vert x_i)\vert\vert p_{\theta}(z\vert x_i)]+ \mathcal{L}(\phi,\theta; x_i) \end{multline}\]
\[ \log{p_{\theta}( x_i)} \geq \mathcal{L}(\phi,\theta; x_i) \]
\[ \max_{\theta}\sum_{i=1}^N\max_{\phi}\mathcal{L}(\phi,\theta;x_i) \label{e}\tag{5}\]
Note that until now, we haven't mentioned either neural networks or VAE. The approach has been very general, and it could apply to any Latent Variable Model which has the DAG representation shown initially.
We solved intractability, but how do we maximize the BBVI objective in a scalable way?
Stochastic Gradient Ascent!
We need SG-based estimators for the ELBO and its gradient with respect to \(\theta\) and \(\phi\)
\[ \nabla_{\theta,\phi}\mathbb{E}_{q_{\phi}(z\vert x_i)}[\log{p_{\theta}(x_i, z)}-\log{q_{\phi}(z\vert x_i)}] \]
The gradient with respect to \(\theta\) is immediate:
\[ \mathbb{E}_{q_{\phi}(z\vert x_i)}[\nabla_{\theta}\log{p_{\theta}(x_i, z)}] \]
we can estimate the expectation using Monte Carlo.
The gradient with respect to \(\phi\) is more badass:
\[ \nabla_{\phi}\mathbb{E}_{q_{\phi}(z\vert x_i)}[\log{p_{\theta}(x_i, z)}-\log{q_{\phi}(z\vert x_i)}] \]
since the expectation and the gradient are both w.r.t \(q_{\phi}\), we cannot simply swap them.
The key contribution of Kingma and Welling, 2014 is the introduction of a low-variance estimator for this gradient, the SGVB (Stochastic Gradient Variational Bayes) estimator, based on the reparametrization trick.
The biggest selling point of the reparametrization trick is that we can now write \(\nabla_{\phi}\mathbb{E}_{q_{\phi}(z\vert x_i)}[f(z)]\) for any function \(f(z)\) as
\[ \nabla_{\phi}\mathbb{E}_{q_{\phi}(z\vert x_i)}[f(z)]=\nabla_{\phi}\mathbb{E}_{p(\epsilon)}[f(g_{\phi}(\epsilon,x_i))]=\mathbb{E}_{p(\epsilon)}[\nabla_{\phi}f(g_{\phi}(\epsilon,x_i))]\]
Using Monte Carlo to estimate this expectation, we obtain the SGVB estimator, which has lower variance than other SG-based estimators such the score function estimator, allowing us to learn more complicated models.
What about discrete latent variables? See van den Oord et al., 2017, with their famous VQ-VAE.
SGVB allows us to estimate the ELBO for a single datapoint, but we need to estimate it for all \(N\). To do this, we use minibatches of size \(M\) (from Kingma and Welling, 2014)
Note: even if \(p_{\theta}(z)\) and \(p_{\theta}(x\vert z)\) are multivariate Gaussian, this doesn't prevent \(p_{\theta}(x)\) from being very complex, because according to Eq. \(\ref{a}\) it's a mixture of an infinite number of Gaussians.
The quality of the samples generated by the original VAE on MNIST have the classical, blurry "VAE" feeling:
More recent results training the Bidirectional-Inference Variational Autoencoder (BIVA) (Maaløe et al., 2019) on CelebA are much better:
The Two-Stage VAE Dai & Wipf, 2019 is even better
But it still doesn't match the performance of models such as StyleGAN or BigGAN. But don't lose hope!
The impressive DeepMind VQ-VAE-2 from Razavi et al., 2019 not only generates images of comparable visual quality to the most advanced GAN models…
…but it also performs better according to the newly introduced Classification Accuracy Score (CAS):
The topic-guided variational autoencoder (TGVAE) Wang et al., 2019 was presented at the ACL 2019, and it's a nice example of a Language Model implemented using topic modelling and VAE