VAEs - HackMD

# VAEs ## How to Prepare Know that [Kingma & Welling 2013](https://arxiv.org/pdf/1312.6114.pdf) exists, but don't spend too much time on it, since we now have better terminology to explain the main ideas. **If you have 5 minutes...** * skim Jaan Altosaar's [blog post](https://jaan.io/what-is-variational-autoencoder-vae-tutorial/) comparing the deep learning vs. probabilistic modeling perspective on VAEs. * read the paragraph starting with "What about the model parameters?" * read section "Mean-field versus amortized inference" **If you have only 1 hour...** - skim Zhang 2018, ["Advances in Variational Inference"](https://arxiv.org/pdf/1711.05597.pdf) - Section 6 has a definition of *amortized inference* and VAEs - skim _one_ of the following to learn about the *amortization gap*: - Cremer 2018, ["Inference Suboptimality in Variational Inference"](https://arxiv.org/abs/1801.03558) - Krishnan 2017, ["On the challenges of learning with inference networks on sparse, high-dimensional data"](https://arxiv.org/abs/1710.06085) - Kim et al. 2018, [Semi-Amortized Variational Autoencoders](http://proceedings.mlr.press/v80/kim18e.html) **If you have 2-3 hours...** Look at some of the optional references below. * Marino et al. 2018, [“Iterative Amortized Inference”](https://arxiv.org/pdf/1807.09356.pdf) * Shu et al. 2018, [“Amortized Inference Regularization”](https://papers.nips.cc/paper/7692-amortized-inference-regularization.pdf) * Hoffman 2017, ["Learning Deep Latent Gaussian Models with Markov Chain Monte Carlo"](http://proceedings.mlr.press/v70/hoffman17a/hoffman17a.pdf) * Cremer 2017, ["Reinterpreting Importance-Weighted Autoencoders"](https://arxiv.org/pdf/1704.02916.pdf) * Shu et al. 2019 [“Training Variational Autoencoders with Buffered Stochastic Variational Inference”](https://arxiv.org/pdf/1902.10294.pdf) ## Key Ideas * Amortized Variational Inference * Amortization gap, approximation gap * problems associated with optimizing variational parameters and model parameters jointly * reparameterization gradient * (optional): embedding MC samplers into variational distributions ## Papers ### Auto-Encoding Variational Bayes (will shorten this later) Setup: - dataset $X = \{x^i\}_{i=1}^N$ with $x^i$ continuous or discrete - per-datapoint latent continuous variable $z_i$ - Model $p_\theta(x,z) =p_\theta(z)p_\theta(x|z)$ - $\theta$ not considered random - VI for $z$ and MAP for $\theta$ (sometimes called variational EM) - marginal likelihood $\log p_\theta(x^1,...x^N) = \sum_i \log p _\theta(x^i)$ - $\log p_\theta(x^i) = KL(q_\lambda(z)||p_\theta(z|x^i)) + \mathbb{E}_{q_\lambda(z)}[\log p_\theta (x,z) - \log q_\lambda(z)]$ - last term is ELBO($\lambda$,$\theta$), optimize w.r.t. $\lambda,\theta$ jointly (problems) This paper: - introduce _recognition model_ $q_\phi(z|x)$ to approximate $p_\theta(z|x)$ - $q_\phi(z|x)$ defined via $\lambda^i = Encoder_\phi(x^i)$ and $z^i \sim q_{\lambda^i}$ - generally called amortization. - alternative: $\lambda^i$ optimized for each $z^i$ - idea mentioned in late 90s Jordan paper Particular well-known VAE from this paper: - $p_\theta(z) = N(z|0, I)$ (no parameters $\theta$) - $p_\theta(x|z)$ is e.g. diagonal MVN with params $(\mu,\sigma) = Decoder_\theta(z)$ - $q$ is amortized so $q_\phi(z|x^i)$ is diagonal MVN with $(\mu,\sigma) = Encoder_\phi(x)$ - reparameterization gradient used - doubly-stochastic (q integral and w.r.t data) - important: decoded $z$ is not a sample of $x$, rather parameterizes $p(x|z)$ - subtle relationships among estimators, optimization, variational family Aside, gradient estimators: - Gradient of ELBO($\phi$,$\theta$) w.r.t. $\phi$ defined via expectation - so it needs an estimator (same in non-amortized case) - general trick: swap grad and expecation - More generally: want $\nabla_\phi \mathbb{E}_{z \sim q_\phi}[f(z)]$ - Score estimator: - $\nabla_\phi \mathbb{E}_{z \sim q_\phi}[f(z)] = \mathbb{E}_{z \sim q_\phi}[f(z) \nabla_{\phi} \log q_\phi(z)]$ - Approximate with $\frac{1}{L} \sum_\ell f(z^\ell) \nabla_{\phi} \log q_\phi (z^\ell)$ with $\{z^\ell\} \sim q_\phi(z)$ - used in RL because only needs samples of $f$ - used when $f$ non-differentiable - high variance and needs tricks - Reparameterization: - write sample $z \sim q_\phi$ as deterministic transformation $r()$ of parameters $\phi$ and noise: - $z = r(\epsilon, \phi)$ - noise $\epsilon$ is drawn from auxilary distribution $p$ with no dependence on $\phi$ - Allows gradient to pass through expectation and be approximated by MC - $\nabla_\phi \mathbb{E}_{z \sim q}[f(z)] = \mathbb{E}_{\epsilon \sim p} [\nabla_\phi f(r(\epsilon,\phi))]$ - Approximate with $\frac{1}{L} \sum_\ell [\nabla_\phi f(r(\epsilon^\ell,\phi))]$ with $\{\epsilon^\ell\} \sim p$ - not always easy to find a reparameterization ### Krishnan - On the challenges of learning with inference networks on sparse, high-dimensional data 2017 (https://arxiv.org/pdf/1710.06085.pdf) - typical VAE setup: - jointly optimize encoder parameters $\phi$ and decoder parameters $\theta$ - encoder does well $\to$ gradients for $\theta$ based on tight lower bound on $\log p(x)$ - encoder does poorly $\to$ bad $\theta$ updates - SVI (2013): - $\lambda^i$ for each $q_\lambda^i(z^i)$ optimized (effectively) until convergence before taking a step for $\theta$. - This paper: mix the two - use $\lambda^i = Encoder(x^i)$ as _warm start_ - optimize ELBO($\lambda^i,\theta$) w.r.t. $\lambda^i$ (effectively) until convergence, $\theta$ fixed - Once $q$ reliable, differentiate ELBO w.r.t $\theta$ and update ### Cremer 2018, "Inference Suboptimality in VAEs" (https://arxiv.org/abs/1801.03558) - builds on Krishnan work just above - studies _amortization gap_: - gap between the log-likelihood and the ELBO - but tries to isolate effects of amortization from effects of expressivness of approximating distribution - requiring the variational parameters to be a parametric function of the input may be too strict a request - If amortization isn't a problem, even if variational distribution not expressive enough, model can accomodate ### Kim - Semi-Amortized Variational Autoencoders 2018 - also concerned with _amortization gap_ - gap propogates to learn suboptimal decoder parameters $\theta$, based on suboptimal $q$, as argued in Krishnan/Cremer - backprops _through_ SVI steps in Krishnan algorithm! - needs total derivative to backprop through the SVI and uses some Hessian-vector product tricks to do this ### Shu et al. NeurIPS 2018, "Amortized Inference Regularization" (https://papers.nips.cc/paper/7692-amortized-inference-regularization.pdf) ## Below is very optional! ### Cremer - Reinterpreting Importance-Weighted Autoencoders (ICLR 2017 workshop) (https://arxiv.org/pdf/1704.02916.pdf) - a reinterpretation of Importance Weighted Autoencoders by Burda, 2016 (https://arxiv.org/pdf/1509.00519.pdf) - original view: tighten bound on p(x) by using importance sampling for $z$ samples used to estimate $q$ integral in ELBO - alternate view: the importance sampler actually induces a more expressive $q$ and the standard ELBO bound is used - VAE by Kingma/Welling optimizes this lower bound on p(x): $$ \log p(x) \geq \mathbb{E}_{z \sim q(z|x)} \Bigg [ \log \frac{p(x,z_k)}{q(z_k|x)} \Bigg ] $$ - IWAE (Burda 2016) instead optimizes: $$\log p(x) \geq \mathbb{E}_{z_1,...z_K \sim q(z|x)} \Bigg[ \log \Big[ \frac{1}{K} \sum_k \frac{p(x,z_k)}{q(z_k|x)} \Big] \Bigg]$$ - observation: this induces distribution $\tilde{q}(z|x,z_{2:K})$ for $z=z_1$: $$ \tilde{q}(z|x,z_{2:K}) = \frac{ p(x,z) / q(z|x)} {\frac{1}{K} \sum_{j=1} \frac{p(x,z_j)}{q(z_j|x)} } q(z|x) = \frac { p(x,z) } { \frac{1}{K} \Big ( \frac{p(x,z)}{q(z|x)} + \sum_{j=2} \frac{p(x,z_j)}{q(z_j|x)} \Big ) } $$ - for $K=1$, $\tilde{q}_z = q_z$ - for $K\geq2$, $\tilde{q}(z|x,z_{2:K})$ depends on true p(z|x) through the weights - $\mathbb{E}_{z_2,...z_K}[\tilde{q}(z|x,z_{2:K})] \to p(z|x)$ pointwise as $k \to \infty$ - Conclusion: IWAE can be intepretted as creating a more powerful q by embedding an MC sampler (self-normalized importance sampling) ### Hoffman - Learning Deep Latent Gaussian Models with Markov Chain Monte Carlo (ICML 2017) (http://proceedings.mlr.press/v70/hoffman17a/hoffman17a.pdf) - Same ideas as Cremer re-interpretation of IWAE - embeds HMC sampler into variational approximation - to tighten the lower bound - improves $\theta$ updates - variational parameters $\lambda^i = Encoder_\phi(x^i)$ - $z_0 \sim q_{\lambda^i}$ - HMC sampler initialized with $z_0$ - last sample $z_M$ from HMC used in ELBO computation - again, this induces a distribution over $z_M$ - "MCMC’s great advantage is that it allows us to trade computation for accuracy without limit" - also presents problem of "variational pruning" - one way to decrease KL($q(z|x)$ || $p(z|x)$ ) is to make $p(x|z)$ depend less on some dimensions of $z$. - bad because we specified the model to have certain latent dim - shallow models depend a lot on dim. (fewer mixtures -> weaker model) - But power of deep models depends on both latent dim. and decoder complexity. - Even with $z \in \mathbb{R}$, deep model can approximate smooth densities in $\mathbb{R}^D$ - Interesting, but unclear to me how the proposed algorithm above specifically deals with pruning, rather than dealing with more general amortization/approximation gap - They do present a cool check for usefulness of latent dims though: eigenvalues of Jacobian of likelihood params w.r.t. $z$ ### Marino et al. 2018, ["Iterative Amortized Inference"](https://arxiv.org/pdf/1807.09356.pdf) ### Shu et al. 2019, ["Training Variational Autoencoders with Buffered Stochastic Variational Inference"](https://arxiv.org/pdf/1902.10294.pdf)