Information Theory

# Information Theory Most of this stuff is a quick summary from Amari [^amari], strongly recommended as an introductory text. We only deal with discrete probabilities here, but the same applies for continuous probabilites; see Bishop [^bishop]. [^amari]: [Amari, 情報理論](https://www.amazon.co.jp/dp/4480093583) [^bishop]: [Bishop, Pattern Recognition and Machine Learning](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf) ## Deriving the function form How would you quantify information? One approach might be to define information as something that reduces uncertainty. Such a formulation of information $f$ should follow a relatively intuitive law: the information gained from observing two independent events $x$ and $y$ that occur with probabilities $p(x)$ and $p(y)$ should be the sum of observing them seperately; i.e. $$ f(x, y) = f(x) + f(y) $$ It turns out if we assume that $f(\cdot)$ is continuous and differentiable, there is only one type of function that adheres to this law. $$ \begin{align} f(x + \epsilon x) &= f((1 + \epsilon) x) \\ &= f(1 + \epsilon) + f(x) \end{align} $$ The derivative of this function can be then written as follows (note the derivative is not defined when $x = 0$): $$ \begin{align} \frac{df}{dx} &= \lim_{ \epsilon \to 0 } \frac{f(x + \epsilon x) - f(x)}{\epsilon x} \\ &= \lim_{ \epsilon \to 0 } \frac{f(1 + \epsilon)}{\epsilon x} \end{align} $$ Let $c = \lim_{ \epsilon \to 0 } f(1 + \epsilon) / \epsilon$ and we get: $$ f'(x) = \frac{c}{x} $$ By integrating both sides we get the following form for $f$: $$ f(x) = c \log x + d $$ where $d$ is the integration constant. Now, if we consider that information of an event that always happens $f(1)$ is $0$, we get $d = 0$. Thus, we can conclude that $\log(\cdot)$ is the formulation we want, and the information gained from observing an event $x$ occuring with probability $p(x)$ is $- \log p(x)$. For reasons out of scope of this discussion, it is generally the case that the base of the $\log$ used here is two, and the units of information are referred to as *bits*. The intuition behind this is the more unlikely an event is (i.e. the probability is low), the more information we gain by observing it. ## Entropy Now that we know the amount of information we gain by observing a given event, we might be inclined to know the average amount of information we can gain by observing **some** event. Consider a random variable $X$ that follows a Bernoulli distribution. The information gained is respectively: - when $X = 1$, $- \log \Pr(X = 1) = - \log p$ - when $X = 0$, $- \log \Pr(X = 0) = - \log (1 - p)$ The average amount of information we can gain from observing the outcome of $X$ is the expectation over the information gained: $-p \log p - (1 - p) \log (1 - p)$. More generally, the expected value of information gained from a random variable $X$ can be written as the following: $$ H[X] = - \sum_x p(x) \log p(x) $$ $H[X]$ is called *entropy* and has various connections to the concept with the same name appearing in thermodynamics and statistical mechanics. Joint and conditional probabilities have their entropy analogues, joint entropy and conditional entropy; $$ \begin{align} H[X | Y] &= - \sum_{x,y} p(x, y) \log p(x | y) \\ H[X, Y] &= H[X | Y] + H[Y] \\ &= H[Y | X] + H[X] \end{align} $$ $H[X | Y]$ corresponds to the amount of information gained when observing $X$ when one already knows $Y$. Intuitively, if $x$ and $y$ are related in some way, knowing about $y$ should reduce the amount of information gained when observing $x$, thus $H[X | Y] \le H[X]$ with the equality holding when $x$ and $y$ are completely unrelated. ## Mutual Information Following from the previous discussion regarding conditional entropy, we might be interested in what information remains in $x$ after observing $y$: $$ I[X, Y] = H[X] - H[X | Y] $$ This is called the *mutual information* between $x$ and $y$, and is, surprisingly, symmetric (i.e. $I[X, Y] = I[Y, X]$). Mutual information can be calculated in the following manner: $$ \begin{align} I[X, Y] &= H[X] - H[X | Y] \\ &= H[X] + H[Y] - H[X, Y] \\ &= \sum_{x,y} p(x, y) \frac{\log p(x, y)}{\log p(x)p(y)} \end{align} $$ Incidentally, This is equal to the KL divergence between $p(x, y)$ and $p(x)p(y)$: $$ I[X, Y] = D_{KL}(p(x, y) || p(x)p(y)) $$ # [Belghazi et al., Mutual Information Neural Estimation](https://arxiv.org/abs/1801.04062) > mutual information captures non-linear > statistical dependencies between variables, and thus can act > as a measure of true dependence (Kinney & Atwal, 2014) Being able to calculate mutual information would be nice, but because of the integration involved, it's obviously intractable to compute exactly unless we're dealing with discrete variables or in special cases with known probability distributions. So if it's hard to get the exact value, what about estimation? Non-parametric approaches seem to be the standard approach but they have scalability problems; i.e. not working for high-dimensional variables. So we don't have a nice estimator that we can use. What about estimating a lower bound and using it as a proxy for mutual information? Devising a nice loss that's similar to adversarial learning is the main idea of this paper. It works well, too! ## Bounding MI from below Let's move on the deriving the loss. KL divergence has a dual representation of the following form (Donsker & Varadhan, 1983): $$ D_{KL}(\mathbb{P} || \mathbb{Q}) = \sup_{T: \Omega \to \mathbb{R}} \mathbb{E}_\mathbb{P}[T] - \log (\mathbb{E}_\mathbb{Q}[e^T]) $$ Where we take supremum over all $T$s for which the right-hand expectations are finite. If $T$ is taken over any class of functions $\mathcal{F}$ satisfying integrability constraints, this equality becomes a lower bound; $$ D_{KL}(\mathbb{P} || \mathbb{Q}) \ge \sup_{T \in \mathcal{F}} \mathbb{E}_\mathbb{P}[T] - \log (\mathbb{E}_\mathbb{Q}[e^T]) $$ When mutual information is expressed in this form, it becomes: $$ I(X, Z) = D_{KL}(\mathbb{P}_{XZ} || \mathbb{P}_X \otimes \mathbb{P}_Z) \ge \sup_{T \in \mathcal{F}} \mathbb{E}_{\mathbb{P}_{XZ}}[T] - \log (\mathbb{E}_{\mathbb{P}_X \otimes \mathbb{P}_Z}[e^T]) $$ The paper proposes the *neural information measure* as the following: $$ I_\theta(X, Z) = \sup_{\theta \in \Theta} \mathbb{E}_{\mathbb{P}_{XZ}}[T_\theta] - \log (\mathbb{E}_{\mathbb{P}_X \otimes \mathbb{P}_Z}[e^{T_\theta}]) $$ with the bound $I(X, Z) \ge I_\theta(X, Z)$. We then estimate this by approximating $T_\theta$ via SGD. This can done by estimating the expectations here by taking samples from a training set corresponding to $\mathbb{P}_{XZ}$, then shuffling the labels to get samples from $\mathbb{P}_X \otimes \mathbb{P}_Z$. ![](https://i.imgur.com/IOjSk5B.png) ## Nice properties of MINE The paper proves some theoretical properties about MINE: - Consistency of the approximation of neural information measure - Convergence of MINE to the neural information measure These together show that MINE is strongly consistent (i.e. MINE can approximate mutual information with arbitrary accuracy), cool. Authors show some empirical comparisions with non-parametric estimation mentioned before, with tighter estimates for MINE. Another result shows MINE successfully demonstrates an interesting property of mutual information: if a random variable $Y$ is defined as $Y = f(X) + \sigma \odot \epsilon$ where $f$ is a deterministic transfomation and $\epsilon$ is random noise, $I(X, Y)$ should only depend on $\sigma \odot \epsilon$, even if $f$ is non-linear. ![](https://i.imgur.com/YWOOGhz.png) ## Applying MINE What's nice about MINE is that it's fully differentiable, so it can be integrated as an objective into other neural networks. The authors try applying MINE to two domains; improving GANs and inference for bi-directional adversarial models. ### Improving GANs Training GANs are notoriously difficult, and one problem that comes up is mode collapse. This phenomenon is where the generator gives up on generating samples that correspond to some modes of the target distribution. Chen et al. [^infogan] concatenates latent variables $c$ to the input of the generator and introduces a loss to improve $I(c, G(z, c))$ in order to learn interpretable representations. They do this by performing what is known as Variational Information Maximization and introducing a joint objective minimizing a variational upper bound on the conditional entropy (remember, $I(c, G(z, c)) = H(c) - H(c|G(z,c))$). [^infogan]: [InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets](https://arxiv.org/abs/1606.03657) The authors hypothesize that this idea of adding an objective maximizing mutual information should improve mode representation and prevent mode collapse, and try using MINE instead to get some good results. ![](https://i.imgur.com/dCdiyaj.jpg) ![](https://i.imgur.com/jFXczKC.png) # [Hjelm et al., Learning Deep Representations by Mutual Information Estimation and Maximization](https://arxiv.org/abs/1808.06670) [slides](https://aisc.ai.science/static/slides/20190411_KaranGrewal.pdf)