# Information Theory
Most of this stuff is a quick summary from Amari [^amari],
strongly recommended as an introductory text.
We only deal with discrete probabilities here,
but the same applies for continuous probabilites;
see Bishop [^bishop].
[^amari]: [Amari, 情報理論](https://www.amazon.co.jp/dp/4480093583)
[^bishop]: [Bishop, Pattern Recognition and Machine Learning](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf)
## Deriving the function form
How would you quantify information?
One approach might be to define information as
something that reduces uncertainty.
Such a formulation of information $f$ should
follow a relatively intuitive law:
the information gained from observing two
independent events $x$ and $y$ that occur with
probabilities $p(x)$ and $p(y)$ should be
the sum of observing them seperately; i.e.
$$
f(x, y) = f(x) + f(y)
$$
It turns out if we assume that $f(\cdot)$ is continuous
and differentiable, there is only one type of function that
adheres to this law.
$$
\begin{align}
f(x + \epsilon x)
&= f((1 + \epsilon) x) \\
&= f(1 + \epsilon) + f(x)
\end{align}
$$
The derivative of this function can be then written as follows (note the derivative is not defined when $x = 0$):
$$
\begin{align}
\frac{df}{dx}
&= \lim_{ \epsilon \to 0 } \frac{f(x + \epsilon x) - f(x)}{\epsilon x} \\
&= \lim_{ \epsilon \to 0 } \frac{f(1 + \epsilon)}{\epsilon x}
\end{align}
$$
Let $c = \lim_{ \epsilon \to 0 } f(1 + \epsilon) / \epsilon$ and we get:
$$
f'(x) = \frac{c}{x}
$$
By integrating both sides we get the following form for $f$:
$$
f(x) = c \log x + d
$$
where $d$ is the integration constant.
Now, if we consider that information of an event
that always happens $f(1)$ is $0$, we get $d = 0$.
Thus, we can conclude that $\log(\cdot)$ is the formulation
we want, and the information gained from observing
an event $x$ occuring with probability $p(x)$ is
$- \log p(x)$. For reasons out of scope of this discussion,
it is generally the case that the base of the $\log$
used here is two, and the units of information are
referred to as *bits*.
The intuition behind this is the more unlikely an event is
(i.e. the probability is low), the more information we gain
by observing it.
## Entropy
Now that we know the amount of information we gain by
observing a given event, we might be inclined to know
the average amount of information we can gain by
observing **some** event.
Consider a random variable $X$ that follows a Bernoulli
distribution. The information gained is respectively:
- when $X = 1$, $- \log \Pr(X = 1) = - \log p$
- when $X = 0$, $- \log \Pr(X = 0) = - \log (1 - p)$
The average amount of information we can gain from
observing the outcome of $X$ is the expectation over
the information gained: $-p \log p - (1 - p) \log (1 - p)$.
More generally, the expected value of information gained
from a random variable $X$ can be written as the following:
$$
H[X] = - \sum_x p(x) \log p(x)
$$
$H[X]$ is called *entropy* and has various connections to
the concept with the same name appearing in thermodynamics and
statistical mechanics.
Joint and conditional probabilities have their entropy analogues,
joint entropy and conditional entropy;
$$
\begin{align}
H[X | Y] &= - \sum_{x,y} p(x, y) \log p(x | y) \\
H[X, Y]
&= H[X | Y] + H[Y] \\
&= H[Y | X] + H[X]
\end{align}
$$
$H[X | Y]$ corresponds to the amount of information gained when
observing $X$ when one already knows $Y$. Intuitively,
if $x$ and $y$ are related in some way, knowing about $y$
should reduce the amount of information gained when observing
$x$, thus $H[X | Y] \le H[X]$ with the equality holding when
$x$ and $y$ are completely unrelated.
## Mutual Information
Following from the previous discussion regarding conditional
entropy, we might be interested in what information remains
in $x$ after observing $y$:
$$
I[X, Y] = H[X] - H[X | Y]
$$
This is called the *mutual information* between $x$ and $y$,
and is, surprisingly, symmetric (i.e. $I[X, Y] = I[Y, X]$).
Mutual information can be calculated in the following manner:
$$
\begin{align}
I[X, Y]
&= H[X] - H[X | Y] \\
&= H[X] + H[Y] - H[X, Y] \\
&= \sum_{x,y} p(x, y) \frac{\log p(x, y)}{\log p(x)p(y)}
\end{align}
$$
Incidentally, This is equal to the KL divergence between
$p(x, y)$ and $p(x)p(y)$:
$$
I[X, Y] = D_{KL}(p(x, y) || p(x)p(y))
$$
# [Belghazi et al., Mutual Information Neural Estimation](https://arxiv.org/abs/1801.04062)
> mutual information captures non-linear
> statistical dependencies between variables, and thus can act
> as a measure of true dependence (Kinney & Atwal, 2014)
Being able to calculate mutual information would be nice,
but because of the integration involved, it's obviously
intractable to compute exactly unless we're dealing with
discrete variables or in special cases with known
probability distributions.
So if it's hard to get the exact value, what about
estimation? Non-parametric approaches seem to be
the standard approach but they have scalability
problems; i.e. not working for high-dimensional
variables.
So we don't have a nice estimator that we can use.
What about estimating a lower bound and using it
as a proxy for mutual information? Devising a nice
loss that's similar to adversarial learning is the
main idea of this paper. It works well, too!
## Bounding MI from below
Let's move on the deriving the loss.
KL divergence has a dual representation of the following
form (Donsker & Varadhan, 1983):
$$
D_{KL}(\mathbb{P} || \mathbb{Q}) =
\sup_{T: \Omega \to \mathbb{R}} \mathbb{E}_\mathbb{P}[T] - \log (\mathbb{E}_\mathbb{Q}[e^T])
$$
Where we take supremum over all $T$s for which the right-hand
expectations are finite. If $T$ is taken over any class of
functions $\mathcal{F}$ satisfying integrability constraints,
this equality becomes a lower bound;
$$
D_{KL}(\mathbb{P} || \mathbb{Q}) \ge
\sup_{T \in \mathcal{F}} \mathbb{E}_\mathbb{P}[T] - \log (\mathbb{E}_\mathbb{Q}[e^T])
$$
When mutual information is expressed in this form, it becomes:
$$
I(X, Z) =
D_{KL}(\mathbb{P}_{XZ} || \mathbb{P}_X \otimes \mathbb{P}_Z) \ge
\sup_{T \in \mathcal{F}} \mathbb{E}_{\mathbb{P}_{XZ}}[T] - \log (\mathbb{E}_{\mathbb{P}_X \otimes \mathbb{P}_Z}[e^T])
$$
The paper proposes the *neural information measure* as the following:
$$
I_\theta(X, Z) =
\sup_{\theta \in \Theta} \mathbb{E}_{\mathbb{P}_{XZ}}[T_\theta] - \log (\mathbb{E}_{\mathbb{P}_X \otimes \mathbb{P}_Z}[e^{T_\theta}])
$$
with the bound $I(X, Z) \ge I_\theta(X, Z)$. We then estimate this
by approximating $T_\theta$ via SGD. This can done by estimating
the expectations here by taking samples from a
training set corresponding to $\mathbb{P}_{XZ}$, then shuffling
the labels to get samples from $\mathbb{P}_X \otimes \mathbb{P}_Z$.

## Nice properties of MINE
The paper proves some theoretical properties about MINE:
- Consistency of the approximation of neural information measure
- Convergence of MINE to the neural information measure
These together show that MINE is strongly consistent (i.e. MINE can
approximate mutual information with arbitrary accuracy), cool.
Authors show some empirical comparisions with non-parametric
estimation mentioned before, with tighter estimates for MINE.
Another result shows MINE successfully demonstrates
an interesting property of mutual information:
if a random variable $Y$ is defined
as $Y = f(X) + \sigma \odot \epsilon$ where $f$ is a deterministic
transfomation and $\epsilon$ is random noise, $I(X, Y)$ should
only depend on $\sigma \odot \epsilon$, even if $f$ is non-linear.

## Applying MINE
What's nice about MINE is that it's fully differentiable, so it
can be integrated as an objective into other neural networks.
The authors try applying MINE to two domains;
improving GANs and inference for bi-directional adversarial models.
### Improving GANs
Training GANs are notoriously difficult, and one problem that
comes up is mode collapse. This phenomenon is where the
generator gives up on generating samples that correspond to
some modes of the target distribution.
Chen et al. [^infogan] concatenates latent variables $c$ to the
input of the generator and introduces a loss to improve
$I(c, G(z, c))$ in order to learn interpretable representations.
They do this by performing what is known as Variational Information
Maximization and introducing a joint objective minimizing
a variational upper bound on the conditional entropy (remember,
$I(c, G(z, c)) = H(c) - H(c|G(z,c))$).
[^infogan]: [InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets](https://arxiv.org/abs/1606.03657)
The authors hypothesize that this idea of adding an objective
maximizing mutual information should improve mode representation
and prevent mode collapse, and try using MINE instead to get
some good results.


# [Hjelm et al., Learning Deep Representations by Mutual Information Estimation and Maximization](https://arxiv.org/abs/1808.06670)
[slides](https://aisc.ai.science/static/slides/20190411_KaranGrewal.pdf)