Laplace VAE - HackMD

# Laplace VAE ## Autoencoder formulation [(Modified from this formulation)](http://users.umiacs.umd.edu/~xyang35/files/understanding-variational-lower.pdf) $$ \min ||\hat{x} - x||_2^2 $$ #### Problem Formulation Assume $X$ are observations (data) generated by an underlying factor $z$ which we don't get to observe. We want to learn the distribution over $z$ using our data. In other words, we want to infer the posterior $p(z|x)$. By Bayes we can do it as: $$ p(z|x) = \frac{p(z,x)}{p(x)} $$ #### Intractable evidence However, $p(x) = \int_{z}p(x,z)dz$ which is intractable for $z \in \mathbb{R}^n$. #### Estimating using VI Instead, we use a tool called variational inference, in which we learn a function $q(z|x)$ to approximate $p(z|x)$. Variational inference does this by framing this problem as optimization where we minimize the KL-divergence $D_{KL}(q||p)$. $$ D_{KL}(q(z|x) || p(z|x)) = \int_{z}q(z|x) \log \frac{q(z|x)}{p(z|x)} $$ #### VAE loss function derivation We can derive the VAE formulation by simplifying the above. \begin{align} D_{KL}(q(z|x) || p(z|x)) &= \int_{z}q(z|x) \log \frac{q(z|x)}{p(z|x)}\\ &= -\int_{z}q(z|x) \log \frac{p(z|x)}{q(z|x)}\\ &= -\mathbb{E}_{z \sim q(z|x)} \log(p(z|x)) - \log(q(z|x))\\ &= -\mathbb{E}_{z \sim q(z|x)} \log(\frac{p(x|z)p(z)}{p(x)}) - \log(q(z|x))\\ &= -\mathbb{E}_{z \sim q(z|x)} \log(p(x|z)p(z)) - \log(p(x)) - \log(q(z|x))\\ &= -\mathbb{E}_{z \sim q(z|x)} \left[ \log(p(x|z)) + \log(p(z)) - \log(p(x)) - \log(q(z|x)) \right]\\ \log(p(x)) &= -\mathbb{E}_{z \sim q(z|x)} \left[ \log(p(x|z)) + \log(p(z)) - \log(q(z|x)) \right]\\ \log(p(x)) &= -\mathbb{E}_{z \sim q(z|x)} \left[ \log(p(x|z)) + \log(\frac{p(z)}{q(z|x)}) \right]\\ D_{KL}(q(z|x) || p(z|x)) + \log(p(x)) &= -\mathbb{E}_{z \sim q(z|x)} \log(p(x|z)) + \mathbb{E}_{z \sim q(z|x)} \log(\frac{q(z|x)}{p(z)}) \end{align} --- ### Interpreting the VAE loss The first term is the reconstruction loss. $$ -\mathbb{E}_{z \sim q(z|x)} \log(p(x|z)) $$ Practically this means we use a function $f(x)$ to parametrize a distribution using $\mu, \sigma = f(x)$. ![](https://i.imgur.com/sCw0ZBZ.png) From q, we sample $\tilde{z} \sim q(z|x)$ ![](https://i.imgur.com/mqD2V4e.png) That's the $\mathbb{E}_{z \sim q(z|x)}$ part of this term. We use $\tilde{z}$ in $\hat{x} = \log p(x|\tilde{z})$ which gives us the predicted image. By minimizing this term, we encourage the reconstruction $\hat{x}$ to be close to $x$ ![](https://i.imgur.com/ugAzfU4.png) The second term $$ \mathbb{E}_{z \sim q(z|x)} \log(\frac{q(z|x)}{p(z)}) $$ can be further understood as: \begin{align} &= \mathbb{E}_{z \sim q(z|x)} \log(\frac{q(z|x)}{p(z)})\\ &= \mathbb{E}_{z \sim q(z|x)} \log(q(z|x)) - \mathbb{E}_{z \sim q(z|x)}\log(p(z))\\ &= H[q] - \mathbb{E}_{z \sim q(z|x)}\log(p(z))\\ \end{align} The first term $H[q]$ is the entropy. This is equivalent to maximizing the variance of $q$ (ie: making the ball wider). ![](https://i.imgur.com/DJT38sR.png) When we generated $\tilde{z}$ we had to sample from q. With a prior $p(z) = \mathcal{N}(0, 1)$, we sampled from $q$ by using the following property $\tilde{z} = \mu(x) + \sigma{x}\cdot \epsilon$ where $\epsilon \sim \mathcal{N}(0, 1)$. This is called the reparameritazion trick. Thus we can view the prior as another distribution in the $z$ space ![](https://i.imgur.com/qfhqRIY.png) And thus the second term $\mathbb{E}_{z \sim q(z|x)}\log(p(z))$ maximizes the probability of $\hat{z}$ under the prior $p(z)$. ![](https://i.imgur.com/VtXzdFG.png) This means we end up pushing a bunch of $q$s to be as wide as possible within the prior. ![](https://i.imgur.com/yTHg1LZ.png) | Inf (posterior $p(z / x)$) |learned (prior) $p(z)$| |---|---| | Gaussian |Gaussian | | Gaussian | Laplace | | - | Laplace | | laplace | Laplace | | Masked code + Gaussian | Laplace | ### TODO: why we want to consider these options --- ## Laplace Distribution ### PDF $$ \frac{1}{2b} \exp \left( - \frac{|x - \mu|}{b} \right) $$ ### CDF if $x \leq \mu$ $$ \frac{1}{2} \exp \left(\frac{x - \mu}{b} \right) $$ if $x \geq \mu$ $$ 1 - \frac{1}{2} \exp \left(-\frac{x - \mu}{b} \right) $$ ### Reparameritazion Looks like PyTorch supports this out of the box. However, the $\frac{x-\mu}{b}$ form implies that the reparameritazion trick would also work here. ### Monte Carlo Estimate Sometimes we don't have a tractable way of calculating an expectation such as: $$ \int x f_{x}(x)dx $$ In this case, we can approximate with: $$ \int x f_{x}(x)dx \approx \frac{1}{N} \sum_{i=0}^{N} f(x_i) $$ ### Monte carlo estimate of KL-divergence The KL divergence is: $$ D_{KL} = \sum p(x)\left[\log p(x) - \log q(x)\right] $$ This is essentially the expectation of the log ratio $$ D_{KL} = \mathbb{E} \left[\log p(x) - \log q(x)\right] $$ Which means we can use the monte carlo estimate to approximate: $$ \mathbb{E} \left[\log p(x) - \log q(x)\right] \approx \frac{1}{N}\sum \log p(x) - \log q(x) $$ ## Evaluating performance How do we know if we approximated $p(x)$ (the distribution of the data) much better with one vs the other model? In short, the literature evaluates using two metrics [[1]](https://arxiv.org/pdf/1511.01844.pdf): 1. Inception score. 2. nll/pixels (nats). ### Nats ### Inception score ## Scratch work ### Marginal likelihood given the data Caveats: 1. Pixels are integers but models use densities. Therefore a good practice is to add real-valued noise to the integer pixel to dequantize data [1]. (this makes it so we can compare log-likelihood between a discrete and continuos model.) The following formula converts pixel intensity (0-255) to a mass function. $$ P(x | \pi, \mu, s) = \sum_{i=1}^{K} \pi_i \left[\sigma (\frac{x+0.5-\mu}{s_i}) - \sigma (\frac{x+0.5-\mu}{s_i}) \right] $$ **Note**: Looks like a softmax works best? [2]. ie" 256 soft-max to map to the correct pixel density. (nat = negative log-likelihood, measure to evaluate) ![](https://i.imgur.com/684m8Tu.png) ![](https://i.imgur.com/Dzb7jT5.png) **Note**: Then PixelCNN+ added a better way? [3]. Discretized logistic mixture likelihood on pixels. ![](https://i.imgur.com/jvQ5vav.png) 2. High-likelihood can still lead to bad samples. 3. Likelihood and visual appearance is largely independent. ### Importance sampling Another way to estimate the marginal likelihood is through importance sampling. Recall that we want to estimate: $$ p(x) = \int_z p(x|z)p(z)dz $$ Where p(x) is the distribution of our data. We can use important sampling as follows: \begin{align} p(x) &= \int_z p(x|z)p(z)dz\\ &= \mathbb{E}_{z \sim p(z)}\left[p_{\theta}(x|z)\right]\\ &= \mathbb{E}_{z \sim p(z)}\left[p_{\theta}(x|z) \frac{q_{\phi}(z|x)}{q_{\phi}(z|x)} \right]\\ &= \mathbb{E}_{z \sim p(z)}\left[q_{\phi}(z|x) \frac{p_{\theta}(x|z)}{q_{\phi}(z|x)} \right]\\ &= \mathbb{E}_{z \sim q(z|x)}\left[p(z) \frac{p_{\theta}(x|z)}{q_{\phi}(z|x)} \right] \end{align} Where: $p(x)$ = Nominal (what we want to estimate that is intractable) $q_{\phi}(z|x)$ = Proposal distribution $f(x) = p_{\theta}(x|z)$ (the decoder) The optimal proposal ($q_{\phi}(z|x)$), is proportional to the nominal pdf multiplied by the function (ie: $p(x) \cdot p_{\theta}(x|z)$). Concretely, this means that we can evaluate the likelihood as follows: 1. Sample $z \sim p(x)$. 2. generate $\hat{x} = p(x|z)$ 3. turn the pixels into a density using the logistic distributions. 4. Evaluate the log likelihood by averaging all probabilities 5. Calculate nat by dividing by the number of pixels ### Self-normalized importance sampling. ### Bits per pixel $$ -\log_2 p(x) / \text{pixels} $$ ![](https://i.imgur.com/AKsHOrY.png) ### The reconstruction error. 1. Can look at nearest neighbors to see quality of sample [1]. ### Parzen window estimate [1] If log-likelihood is unavailable we can use this. 1. Generate samples from model. 2. Construct tractable model (density estimator like Gaussian kernel). 3. Test log-likelihood evaluated under this model as a proxy for true model likelihood. Problem is that if dimension is high, the estimates are not even close to the true log-likelihood. ### Idea 1: Resnet classification score Use a pretrained resnet to "classify" the image from model A vs same image from model B. The better model has the most accurate classifiaction score, OR, the closest score to what the resnet gets for that class. ## Key goals 1. Learn a distribution over natural images. 2. Tractably compute likelihood of new images. 3. Generate new images. --- ## References [[1]](https://arxiv.org/pdf/1511.01844.pdf) Theis, Lucas, Aäron van den Oord, and Matthias Bethge. "A note on the evaluation of generative models." arXiv preprint arXiv:1511.01844 (2015). [[2]](https://arxiv.org/pdf/1601.06759.pdf) Pixel RNN/CNN. [[3]](https://arxiv.org/pdf/1701.05517.pdf) Pixel CNN++ [[4]](https://arxiv.org/pdf/1310.1757.pdf) A Deep and Tractable Density Estimator [[5]](https://statweb.stanford.edu/~owen/mc/Ch-var-is.pdf) Importance sampling