# Variational Information Bottleneck, $\beta$ - Variational Autoencoders
We first define a few terms.
## KL Divergence
$$D_{KL}(p(x) || q(x)) = E_{x \sim p(x)} [\log{\dfrac{p(x)}{q(x)}}]$$
*Intuition*
1. $D_{KL}(p(x) || q(x))$ is a measure of information gain needed to move from $q(x)$ to $qp(x)$.
2. If two distributions are exactly similar, $D_{KL}$ is $0$.
3. $D_{KL} \geq 0$.
## Entropy
$X$ is a discrete random variable.
$$H(X) = -\sum_{k=1}^K p(X=k)\log{p(X=k)} = -E_{X}[\log{p(X)}]$$
*Intuition*
1. $H(X)$ is equal to a constant value minus the $D_{KL}$ from uniform distribution.
2. $H(X)$ attains maximum value when $p(X)$ is uniform.
3. Entropy of a discrete random variable is non-negative.
## Differential Entropy
$X$ is a continuous random variable.
$$h(X) = -\int_{X} p(x) \log{p(x)} dx$$
*Intuition*
1. Measure of uncertainity of a random variable. Entropy of uniform continuous distribution is $0$.
2. Entropy of a continuous random variable can be negative.
## Mutual Information
$$I(X;Y) = D_{KL}(p(x, y) || p(x)p(y))$$
$$I(X;Y) = H(X) - H(X | Y) = H(Y) - H(Y | X) = H(X) + H(Y) - H(X, Y)$$
_Intuition_
Mutual Information is the information gain necessary to move from a model that treats $X$ and $Y$ as independent random variables to the true model density $X, Y$.
$MI$ is the reduction in uncertainity about $X$ after observing $Y$, by symmetry the reduction in uncertainity about $Y$ after observing $X$.
### Properties
1. $X \Rightarrow Y \Rightarrow Z \implies I(X;Z) \leq I(X, Y)$.
## Variational Bounds on Mutual Information
### Upper bound
Assumption: $p(x, y)$ is intractable. We can evaluate $P(y|x)$ and sample from $p(x)$.
$$I(x;y) = E_{p(x, y)}\left[\log{\dfrac{p(x, y)}{p(x)p(y)}}\right] = E_{p(x, y)}\left[\log{\dfrac{p(y | x)}{p(y)}}\right]$$
$$I(x;y) = E_{p(x, y)}\left[\log{\dfrac{p(y | x)}{q(y)}}\right] - D_{KL}(p(y) || q(y))$$
$$I(x;y) \leq E_{p(x, y)}\left[\log{\dfrac{p(y | x)}{q(y)}}\right]$$
$$I(x;y) \leq E_{p(x)}\left[ E_{p(y|x)}\left[ \log{\dfrac{p(y | x)}{q(y)}} \right] \right]$$
$$I(x;y) \leq E_{p(x)}\left[ D_{KL}(p(y|x) || q(y))] \right]$$
### Lower bound
Assumption:
## The Information bottlenck
Add a stochastic bottleneck $z$ between $x$ and $y$ to prevent overfitting and improve robustness.
Representation $z$ is sufficient for task $y$ if $$I(z; y) = I(x; y) \tag{1}$$. Representation $z$ is minimal sufficient if no other $z$ which satisfies $(1)$ with lower $I(z, x)$ exists.
$$\min \beta I(z;x) - I(z;y) ; \beta > 0$$
$\beta$ captures the tradeoff between sufficiency and minimality.
<!-- ## Information bottleneck through Variational glass
This is a summary of the paper https://arxiv.org/pdf/1912.00830.pdf
### Supervised models
Training data distribution: $p(c, x)$
Training data: $(x_m, c_m)_{m=1}^{N}$, $x \in R^N$, $c \in \mathcal{M} = {1, 2, 3 \dots M_c}$
$$\min_{\phi: I(Z, C) \geq I_c} I_{\phi}(X; Z)$$
where $q_{\phi}(z|x)$ is a probabilistic mapping with parameters $\phi$. We want to preserve the information necessary for classification using $\phi: I(Z, C) \geq I_c$, and find minimal sufficient statistic $z$ by minimizing $I_{\phi}(X; Z)$. Lagrangian equation for the above objective function
$$L^{s}(\phi) = I_{\phi}(X; Z) - \beta I(Z, C) $$
$$\hat{\phi} = \underset{\phi}{\text{argmin}}\; L^{s}(\phi)$$ -->
<!-- ### Why is Vanilla IB intractable ?
Vanilla IB Objective $\min \beta I(z;x) - I(z;y)$ -->
## $\beta$-VAE
We add a additional constraint to our original $VAE$ objective. $\max_{\theta} E_{p_\theta(z)}[p_{\theta}(x | z)]$ subject to $D_{KL}(q_{\phi}(z | x) | p(z)) < \epsilon$
$$\max_{\theta, \phi} \left\{ E_{p_\theta(z)}[p_{\theta}(x | z)] - \beta \left( D_{KL}(q_{\phi}(z | x) | p(z)) - \epsilon \right) \right\}$$
$$\max_{\theta, \phi} \left\{ E_{p_\theta(z)}[p_{\theta}(x | z)] - \beta D_{KL}(q_{\phi}(z | x) | p(z)) \right\}$$
1. $\beta$ represents the trade-off between latent channel capacity and independence constraints, with reconstruction accuracy
2. $\beta > 1$ forces model to learn efficient representations by forcing a stronger constraint on latent bottleneck.
3.
2. Learning disentangled representations of generative factors is useful for downstream tasks (and also achieving good compression ?). Disentangled representation implies, change in a single generative factor correspond to change in a single latent variable.
3. Degree of disentanglement ?
| $\beta$ | Description |
| ----------- | ----------- |
| $\beta = 1$ | VAE Original |
| $\beta > 1$ | VAE forced to learning more efficient representations |
## Deep Variational Information Bottleneck
1. Vanilla IB Objective is intractable.
2. Adds variational bounds to terms in Vanilla IB objective, parametrized by neural networks
$$\max{I(Z; Y) - \beta I(Z; X)}$$
$$I(Z;Y) = \int p(z, y) \log{\dfrac{p(z, y)}{p(z) p(y)}} dy dz = \int p(z, y) \log{\dfrac{p(y | z)}{p(y)}} dy dz$$
We use $q(y | z)$ as a variational approximation $p(y|z)$ with , and using $D_{KL}(q(y | z) || p(y | z)) \geq 0$, we get,
$$I(Z;Y) \geq \int p(y, z) \log{q(y | z)} dy dz - \int p(y) \log{p(y)} dy$$
$$I(Z;Y) \geq \int p(y, z) \log{q(y | z)} dy dz + H(Y)$$
$$I(Z;Y) \geq \int p(y, z) \log{q(y | z)} dy dz$$
$$I(Z;Y) \geq \int p(x) p(y | x) p(z | x) \log{q(y | z)} dy dz$$
Using $r(z)$ as a variational approximation for $p(z)$,
$$I(Z; X) = \int p(z, x) \log{\dfrac{p(z, x)}{p(z) p(x)}} dx dz = \int p(z, x) \log{\dfrac{p(z | x)}{p(z)}} dx dz$$
$$I(Z; X) = \int p(z, x) \log{p(z | x)} dx dz - H(Z)$$
$$I(Z; X) \leq \int p(x) p(z | x) \log{\dfrac{p(z | x)}{r(z)}} dx dz$$
$$\max{I(Z; Y) - \beta I(Z; X)} \geq \int p(x) p(y | x) p(z | x) \log{q(y | z)} dy dz - \beta \int p(x) p(z | x) \log{\dfrac{p(z | x)}{r(z)}} dx dz$$
Using dirac delta functions for joint distribution $p(x, y$, we get,
$$\max{I(Z; Y) - \beta I(Z; X)} \geq \dfrac{1}{N} \sum_{i=1}^{N} \left[ \int p(z | x_n) \log{q(y_n | z)} dz - \beta p(z | x_n) \log{\dfrac{p(z | x_n)}{r(z)}} \right]$$
## Questions we want to answer!
$\beta-VAE$ and Information Bottleneck are two schemes to learn compressed representations. Both use variational bounds for optimization. We want to compare these two schemes to see where one is better
Prior
(i) Representations will be different due to different objective functions. DVIB adds a constraint on MI between representation and data, where as $\beta$-VAE tries to constrain latent representation posterior.
(ii) Representations from $DVIB$ are less entagled than representations from $\beta-VAE$ (because the paper mentions disentanglement sooo many times! :sweat_smile: )
## Experiment Design
Dataset: MNIST
1. $DVIB$: Check how $I(X; Z)$ and $I(Y;Z)$ vary with training iterations
2. $DVIB$: Check how $I(X; Z)$ and $I(Y;Z)$ vary with $\beta$
3. $\beta$-VAE: Train with varying $\beta$
4. Visualize representations in 2D.
5. Check disentaglement of representations. Disentanglement metric with varying $\beta$.
6. Check downstream task performance.
DVIB: MNIST in 2D latent space with varying $\beta$
