# VAEs
## How to Prepare
Know that [Kingma & Welling 2013](https://arxiv.org/pdf/1312.6114.pdf) exists, but don't spend too much time on it, since we now have better terminology to explain the main ideas.
**If you have 5 minutes...**
* skim Jaan Altosaar's [blog post](https://jaan.io/what-is-variational-autoencoder-vae-tutorial/) comparing the deep learning vs. probabilistic modeling perspective on VAEs.
* read the paragraph starting with "What about the model parameters?"
* read section "Mean-field versus amortized inference"
**If you have only 1 hour...**
- skim Zhang 2018, ["Advances in Variational Inference"](https://arxiv.org/pdf/1711.05597.pdf)
- Section 6 has a definition of *amortized inference* and VAEs
- skim _one_ of the following to learn about the *amortization gap*:
- Cremer 2018, ["Inference Suboptimality in Variational Inference"](https://arxiv.org/abs/1801.03558)
- Krishnan 2017, ["On the challenges of learning with inference networks on sparse, high-dimensional data"](https://arxiv.org/abs/1710.06085)
- Kim et al. 2018, [Semi-Amortized Variational Autoencoders](http://proceedings.mlr.press/v80/kim18e.html)
**If you have 2-3 hours...**
Look at some of the optional references below.
* Marino et al. 2018, [“Iterative Amortized Inference”](https://arxiv.org/pdf/1807.09356.pdf)
* Shu et al. 2018, [“Amortized Inference Regularization”](https://papers.nips.cc/paper/7692-amortized-inference-regularization.pdf)
* Hoffman 2017, ["Learning Deep Latent Gaussian Models with Markov Chain Monte Carlo"](http://proceedings.mlr.press/v70/hoffman17a/hoffman17a.pdf)
* Cremer 2017, ["Reinterpreting Importance-Weighted Autoencoders"](https://arxiv.org/pdf/1704.02916.pdf)
* Shu et al. 2019 [“Training Variational Autoencoders with Buffered Stochastic Variational Inference”](https://arxiv.org/pdf/1902.10294.pdf)
## Key Ideas
* Amortized Variational Inference
* Amortization gap, approximation gap
* problems associated with optimizing variational parameters and model parameters jointly
* reparameterization gradient
* (optional): embedding MC samplers into variational distributions
## Papers
### Auto-Encoding Variational Bayes (will shorten this later)
Setup:
- dataset $X = \{x^i\}_{i=1}^N$ with $x^i$ continuous or discrete
- per-datapoint latent continuous variable $z_i$
- Model $p_\theta(x,z) =p_\theta(z)p_\theta(x|z)$
- $\theta$ not considered random
- VI for $z$ and MAP for $\theta$ (sometimes called variational EM)
- marginal likelihood $\log p_\theta(x^1,...x^N) = \sum_i \log p _\theta(x^i)$
- $\log p_\theta(x^i) = KL(q_\lambda(z)||p_\theta(z|x^i)) + \mathbb{E}_{q_\lambda(z)}[\log p_\theta (x,z) - \log q_\lambda(z)]$
- last term is ELBO($\lambda$,$\theta$), optimize w.r.t. $\lambda,\theta$ jointly (problems)
This paper:
- introduce _recognition model_ $q_\phi(z|x)$ to approximate $p_\theta(z|x)$
- $q_\phi(z|x)$ defined via $\lambda^i = Encoder_\phi(x^i)$ and $z^i \sim q_{\lambda^i}$
- generally called amortization.
- alternative: $\lambda^i$ optimized for each $z^i$
- idea mentioned in late 90s Jordan paper
Particular well-known VAE from this paper:
- $p_\theta(z) = N(z|0, I)$ (no parameters $\theta$)
- $p_\theta(x|z)$ is e.g. diagonal MVN with params $(\mu,\sigma) = Decoder_\theta(z)$
- $q$ is amortized so $q_\phi(z|x^i)$ is diagonal MVN with $(\mu,\sigma) = Encoder_\phi(x)$
- reparameterization gradient used
- doubly-stochastic (q integral and w.r.t data)
- important: decoded $z$ is not a sample of $x$, rather parameterizes $p(x|z)$
- subtle relationships among estimators, optimization, variational family
Aside, gradient estimators:
- Gradient of ELBO($\phi$,$\theta$) w.r.t. $\phi$ defined via expectation
- so it needs an estimator (same in non-amortized case)
- general trick: swap grad and expecation
- More generally: want $\nabla_\phi \mathbb{E}_{z \sim q_\phi}[f(z)]$
- Score estimator:
- $\nabla_\phi \mathbb{E}_{z \sim q_\phi}[f(z)] = \mathbb{E}_{z \sim q_\phi}[f(z) \nabla_{\phi} \log q_\phi(z)]$
- Approximate with $\frac{1}{L} \sum_\ell f(z^\ell) \nabla_{\phi} \log q_\phi (z^\ell)$ with $\{z^\ell\} \sim q_\phi(z)$
- used in RL because only needs samples of $f$
- used when $f$ non-differentiable
- high variance and needs tricks
- Reparameterization:
- write sample $z \sim q_\phi$ as deterministic transformation $r()$ of parameters $\phi$ and noise:
- $z = r(\epsilon, \phi)$
- noise $\epsilon$ is drawn from auxilary distribution $p$ with no dependence on $\phi$
- Allows gradient to pass through expectation and be approximated by MC
- $\nabla_\phi \mathbb{E}_{z \sim q}[f(z)] = \mathbb{E}_{\epsilon \sim p} [\nabla_\phi f(r(\epsilon,\phi))]$
- Approximate with $\frac{1}{L} \sum_\ell [\nabla_\phi f(r(\epsilon^\ell,\phi))]$ with $\{\epsilon^\ell\} \sim p$
- not always easy to find a reparameterization
### Krishnan - On the challenges of learning with inference networks on sparse, high-dimensional data 2017
(https://arxiv.org/pdf/1710.06085.pdf)
- typical VAE setup:
- jointly optimize encoder parameters $\phi$ and decoder parameters $\theta$
- encoder does well $\to$ gradients for $\theta$ based on tight lower bound on $\log p(x)$
- encoder does poorly $\to$ bad $\theta$ updates
- SVI (2013):
- $\lambda^i$ for each $q_\lambda^i(z^i)$ optimized (effectively) until convergence before taking a step for $\theta$.
- This paper: mix the two
- use $\lambda^i = Encoder(x^i)$ as _warm start_
- optimize ELBO($\lambda^i,\theta$) w.r.t. $\lambda^i$ (effectively) until convergence, $\theta$ fixed
- Once $q$ reliable, differentiate ELBO w.r.t $\theta$ and update
### Cremer 2018, "Inference Suboptimality in VAEs"
(https://arxiv.org/abs/1801.03558)
- builds on Krishnan work just above
- studies _amortization gap_:
- gap between the log-likelihood and the ELBO
- but tries to isolate effects of amortization
from effects of expressivness of approximating distribution
- requiring the variational parameters to be a parametric function of the input may be too strict a request
- If amortization isn't a problem, even if variational distribution not expressive enough, model can accomodate
### Kim - Semi-Amortized Variational Autoencoders 2018
- also concerned with _amortization gap_
- gap propogates to learn suboptimal decoder parameters $\theta$, based on suboptimal $q$, as argued in Krishnan/Cremer
- backprops _through_ SVI steps in Krishnan algorithm!
- needs total derivative to backprop through the SVI and uses some Hessian-vector product tricks to do this
### Shu et al. NeurIPS 2018, "Amortized Inference Regularization"
(https://papers.nips.cc/paper/7692-amortized-inference-regularization.pdf)
## Below is very optional!
### Cremer - Reinterpreting Importance-Weighted Autoencoders (ICLR 2017 workshop)
(https://arxiv.org/pdf/1704.02916.pdf)
- a reinterpretation of Importance Weighted Autoencoders by Burda, 2016 (https://arxiv.org/pdf/1509.00519.pdf)
- original view: tighten bound on p(x) by using importance sampling for $z$ samples used to estimate $q$ integral in ELBO
- alternate view: the importance sampler actually induces a more expressive $q$ and the standard ELBO bound is used
- VAE by Kingma/Welling optimizes this lower bound on p(x): $$ \log p(x) \geq \mathbb{E}_{z \sim q(z|x)} \Bigg [ \log \frac{p(x,z_k)}{q(z_k|x)} \Bigg ] $$
- IWAE (Burda 2016) instead optimizes: $$\log p(x) \geq \mathbb{E}_{z_1,...z_K \sim q(z|x)} \Bigg[ \log \Big[ \frac{1}{K} \sum_k \frac{p(x,z_k)}{q(z_k|x)} \Big] \Bigg]$$
- observation: this induces distribution $\tilde{q}(z|x,z_{2:K})$ for $z=z_1$: $$ \tilde{q}(z|x,z_{2:K}) = \frac{ p(x,z) / q(z|x)} {\frac{1}{K} \sum_{j=1} \frac{p(x,z_j)}{q(z_j|x)} } q(z|x) = \frac { p(x,z) } { \frac{1}{K} \Big ( \frac{p(x,z)}{q(z|x)} + \sum_{j=2} \frac{p(x,z_j)}{q(z_j|x)} \Big ) } $$
- for $K=1$, $\tilde{q}_z = q_z$
- for $K\geq2$, $\tilde{q}(z|x,z_{2:K})$ depends on true p(z|x) through the weights
- $\mathbb{E}_{z_2,...z_K}[\tilde{q}(z|x,z_{2:K})] \to p(z|x)$ pointwise as $k \to \infty$
- Conclusion: IWAE can be intepretted as creating a more powerful q by embedding an MC sampler (self-normalized importance sampling)
### Hoffman - Learning Deep Latent Gaussian Models with Markov Chain Monte Carlo (ICML 2017)
(http://proceedings.mlr.press/v70/hoffman17a/hoffman17a.pdf)
- Same ideas as Cremer re-interpretation of IWAE
- embeds HMC sampler into variational approximation
- to tighten the lower bound
- improves $\theta$ updates
- variational parameters $\lambda^i = Encoder_\phi(x^i)$
- $z_0 \sim q_{\lambda^i}$
- HMC sampler initialized with $z_0$
- last sample $z_M$ from HMC used in ELBO computation
- again, this induces a distribution over $z_M$
- "MCMC’s great advantage is that it allows us to trade computation for accuracy without limit"
- also presents problem of "variational pruning"
- one way to decrease KL($q(z|x)$ || $p(z|x)$ ) is to make $p(x|z)$ depend less on some dimensions of $z$.
- bad because we specified the model to have certain latent dim
- shallow models depend a lot on dim. (fewer mixtures -> weaker model)
- But power of deep models depends on both latent dim. and decoder complexity.
- Even with $z \in \mathbb{R}$, deep model can approximate smooth densities in $\mathbb{R}^D$
- Interesting, but unclear to me how the proposed algorithm above specifically deals with pruning, rather than dealing with more general amortization/approximation gap
- They do present a cool check for usefulness of latent dims though: eigenvalues of Jacobian of likelihood params w.r.t. $z$
### Marino et al. 2018, ["Iterative Amortized Inference"](https://arxiv.org/pdf/1807.09356.pdf)
### Shu et al. 2019, ["Training Variational Autoencoders with Buffered Stochastic Variational Inference"](https://arxiv.org/pdf/1902.10294.pdf)