# Laplace VAE
## Autoencoder formulation
[(Modified from this formulation)](http://users.umiacs.umd.edu/~xyang35/files/understanding-variational-lower.pdf)
$$
\min ||\hat{x} - x||_2^2
$$
#### Problem Formulation
Assume $X$ are observations (data) generated by an underlying factor $z$ which we don't get to observe. We want to learn the distribution over $z$ using our data. In other words, we want to infer the posterior $p(z|x)$.
By Bayes we can do it as:
$$
p(z|x) = \frac{p(z,x)}{p(x)}
$$
#### Intractable evidence
However, $p(x) = \int_{z}p(x,z)dz$ which is intractable for $z \in \mathbb{R}^n$.
#### Estimating using VI
Instead, we use a tool called variational inference, in which we learn a function $q(z|x)$ to approximate $p(z|x)$. Variational inference does this by framing this problem as optimization where we minimize the KL-divergence $D_{KL}(q||p)$.
$$
D_{KL}(q(z|x) || p(z|x)) = \int_{z}q(z|x) \log \frac{q(z|x)}{p(z|x)}
$$
#### VAE loss function derivation
We can derive the VAE formulation by simplifying the above.
\begin{align}
D_{KL}(q(z|x) || p(z|x)) &= \int_{z}q(z|x) \log \frac{q(z|x)}{p(z|x)}\\
&= -\int_{z}q(z|x) \log \frac{p(z|x)}{q(z|x)}\\
&= -\mathbb{E}_{z \sim q(z|x)} \log(p(z|x)) - \log(q(z|x))\\
&= -\mathbb{E}_{z \sim q(z|x)} \log(\frac{p(x|z)p(z)}{p(x)}) - \log(q(z|x))\\
&= -\mathbb{E}_{z \sim q(z|x)} \log(p(x|z)p(z)) - \log(p(x)) - \log(q(z|x))\\
&= -\mathbb{E}_{z \sim q(z|x)} \left[ \log(p(x|z)) + \log(p(z)) - \log(p(x)) - \log(q(z|x)) \right]\\
\log(p(x)) &= -\mathbb{E}_{z \sim q(z|x)} \left[ \log(p(x|z)) + \log(p(z)) - \log(q(z|x)) \right]\\
\log(p(x)) &= -\mathbb{E}_{z \sim q(z|x)} \left[ \log(p(x|z)) + \log(\frac{p(z)}{q(z|x)}) \right]\\
D_{KL}(q(z|x) || p(z|x)) + \log(p(x)) &= -\mathbb{E}_{z \sim q(z|x)} \log(p(x|z)) + \mathbb{E}_{z \sim q(z|x)} \log(\frac{q(z|x)}{p(z)})
\end{align}
---
### Interpreting the VAE loss
The first term is the reconstruction loss.
$$
-\mathbb{E}_{z \sim q(z|x)} \log(p(x|z))
$$
Practically this means we use a function $f(x)$ to parametrize a distribution using $\mu, \sigma = f(x)$.

From q, we sample $\tilde{z} \sim q(z|x)$

That's the $\mathbb{E}_{z \sim q(z|x)}$ part of this term. We use $\tilde{z}$ in $\hat{x} = \log p(x|\tilde{z})$ which gives us the predicted image. By minimizing this term, we encourage the reconstruction $\hat{x}$ to be close to $x$

The second term
$$
\mathbb{E}_{z \sim q(z|x)} \log(\frac{q(z|x)}{p(z)})
$$
can be further understood as:
\begin{align}
&= \mathbb{E}_{z \sim q(z|x)} \log(\frac{q(z|x)}{p(z)})\\
&= \mathbb{E}_{z \sim q(z|x)} \log(q(z|x)) - \mathbb{E}_{z \sim q(z|x)}\log(p(z))\\
&= H[q] - \mathbb{E}_{z \sim q(z|x)}\log(p(z))\\
\end{align}
The first term $H[q]$ is the entropy. This is equivalent to maximizing the variance of $q$ (ie: making the ball wider).

When we generated $\tilde{z}$ we had to sample from q. With a prior $p(z) = \mathcal{N}(0, 1)$, we sampled from $q$ by using the following property $\tilde{z} = \mu(x) + \sigma{x}\cdot \epsilon$ where $\epsilon \sim \mathcal{N}(0, 1)$. This is called the reparameritazion trick.
Thus we can view the prior as another distribution in the $z$ space

And thus the second term $\mathbb{E}_{z \sim q(z|x)}\log(p(z))$ maximizes the probability of $\hat{z}$ under the prior $p(z)$.

This means we end up pushing a bunch of $q$s to be as wide as possible within the prior.

| Inf (posterior $p(z / x)$) |learned (prior) $p(z)$|
|---|---|
| Gaussian |Gaussian |
| Gaussian | Laplace |
| - | Laplace |
| laplace | Laplace |
| Masked code + Gaussian | Laplace |
### TODO: why we want to consider these options
---
## Laplace Distribution
### PDF
$$
\frac{1}{2b} \exp \left( - \frac{|x - \mu|}{b} \right)
$$
### CDF
if $x \leq \mu$
$$
\frac{1}{2} \exp \left(\frac{x - \mu}{b} \right)
$$
if $x \geq \mu$
$$
1 - \frac{1}{2} \exp \left(-\frac{x - \mu}{b} \right)
$$
### Reparameritazion
Looks like PyTorch supports this out of the box. However, the $\frac{x-\mu}{b}$ form implies that the reparameritazion trick would also work here.
### Monte Carlo Estimate
Sometimes we don't have a tractable way of calculating an expectation such as:
$$
\int x f_{x}(x)dx
$$
In this case, we can approximate with:
$$
\int x f_{x}(x)dx \approx \frac{1}{N} \sum_{i=0}^{N} f(x_i)
$$
### Monte carlo estimate of KL-divergence
The KL divergence is:
$$
D_{KL} = \sum p(x)\left[\log p(x) - \log q(x)\right]
$$
This is essentially the expectation of the log ratio
$$
D_{KL} = \mathbb{E} \left[\log p(x) - \log q(x)\right]
$$
Which means we can use the monte carlo estimate to approximate:
$$
\mathbb{E} \left[\log p(x) - \log q(x)\right] \approx \frac{1}{N}\sum \log p(x) - \log q(x)
$$
## Evaluating performance
How do we know if we approximated $p(x)$ (the distribution of the data) much better with one vs the other model?
In short, the literature evaluates using two metrics [[1]](https://arxiv.org/pdf/1511.01844.pdf):
1. Inception score.
2. nll/pixels (nats).
### Nats
### Inception score
## Scratch work
### Marginal likelihood given the data
Caveats:
1. Pixels are integers but models use densities. Therefore a good practice is to add real-valued noise to the integer pixel to dequantize data [1]. (this makes it so we can compare log-likelihood between a discrete and continuos model.)
The following formula converts pixel intensity (0-255) to a mass function.
$$
P(x | \pi, \mu, s) = \sum_{i=1}^{K} \pi_i \left[\sigma (\frac{x+0.5-\mu}{s_i}) - \sigma (\frac{x+0.5-\mu}{s_i}) \right]
$$
**Note**: Looks like a softmax works best? [2]. ie" 256 soft-max to map to the correct pixel density. (nat = negative log-likelihood, measure to evaluate)


**Note**: Then PixelCNN+ added a better way? [3]. Discretized logistic mixture likelihood on pixels.

2. High-likelihood can still lead to bad samples.
3. Likelihood and visual appearance is largely independent.
### Importance sampling
Another way to estimate the marginal likelihood is through importance sampling.
Recall that we want to estimate:
$$
p(x) = \int_z p(x|z)p(z)dz
$$
Where p(x) is the distribution of our data.
We can use important sampling as follows:
\begin{align}
p(x) &= \int_z p(x|z)p(z)dz\\
&= \mathbb{E}_{z \sim p(z)}\left[p_{\theta}(x|z)\right]\\
&= \mathbb{E}_{z \sim p(z)}\left[p_{\theta}(x|z) \frac{q_{\phi}(z|x)}{q_{\phi}(z|x)} \right]\\
&= \mathbb{E}_{z \sim p(z)}\left[q_{\phi}(z|x) \frac{p_{\theta}(x|z)}{q_{\phi}(z|x)} \right]\\
&= \mathbb{E}_{z \sim q(z|x)}\left[p(z) \frac{p_{\theta}(x|z)}{q_{\phi}(z|x)} \right]
\end{align}
Where:
$p(x)$ = Nominal (what we want to estimate that is intractable)
$q_{\phi}(z|x)$ = Proposal distribution
$f(x) = p_{\theta}(x|z)$ (the decoder)
The optimal proposal ($q_{\phi}(z|x)$), is proportional to the nominal pdf multiplied by the function (ie: $p(x) \cdot p_{\theta}(x|z)$).
Concretely, this means that we can evaluate the likelihood as follows:
1. Sample $z \sim p(x)$.
2. generate $\hat{x} = p(x|z)$
3. turn the pixels into a density using the logistic distributions.
4. Evaluate the log likelihood by averaging all probabilities
5. Calculate nat by dividing by the number of pixels
### Self-normalized importance sampling.
### Bits per pixel
$$
-\log_2 p(x) / \text{pixels}
$$

### The reconstruction error.
1. Can look at nearest neighbors to see quality of sample [1].
### Parzen window estimate [1]
If log-likelihood is unavailable we can use this.
1. Generate samples from model.
2. Construct tractable model (density estimator like Gaussian kernel).
3. Test log-likelihood evaluated under this model as a proxy for true model likelihood.
Problem is that if dimension is high, the estimates are not even close to the true log-likelihood.
### Idea 1: Resnet classification score
Use a pretrained resnet to "classify" the image from model A vs same image from model B. The better model has the most accurate classifiaction score, OR, the closest score to what the resnet gets for that class.
## Key goals
1. Learn a distribution over natural images.
2. Tractably compute likelihood of new images.
3. Generate new images.
---
## References
[[1]](https://arxiv.org/pdf/1511.01844.pdf) Theis, Lucas, AƤron van den Oord, and Matthias Bethge. "A note on the evaluation of generative models." arXiv preprint arXiv:1511.01844 (2015).
[[2]](https://arxiv.org/pdf/1601.06759.pdf) Pixel RNN/CNN.
[[3]](https://arxiv.org/pdf/1701.05517.pdf) Pixel CNN++
[[4]](https://arxiv.org/pdf/1310.1757.pdf) A Deep and Tractable Density Estimator
[[5]](https://statweb.stanford.edu/~owen/mc/Ch-var-is.pdf) Importance sampling