Jensen-Shannon and the Mutual Information

--- title: Jensen-Shannon and the Mutual Information --- ## Jensen-Shannon and the Mutual Information Thte Jensen-Shannon divergence between $Q$ and $P$ is defined as $$ \operatorname{JS}[P,Q] = \frac{1}{2}\left(\operatorname{KL}\left[P\middle\|\frac{P+Q}{2}\right] + \operatorname{KL}\left[Q\middle\|\frac{P+Q}{2}\right]\right) $$ This has a neat infomation theoretic interpretation. For the explanation consider two urns of balls, one where colour of the ball is distributed like $P$ and one where it's distributed like $Q$. Consider a random coinflip $Y$ such that $\mathbb{P}[Y=1] = \mathbb{P}[Y=0] = \frac{1}{2}$. Now we draw a random variable $X$ conditioned on the coinflip $Y$ such that: $P_{X\vert Y=1} = P$ and $P_{X\vert Y=0} = Q$. In other words, we toss a coin ($Y$) and if it's heads we draw a ball from the $P$-urn, otherwise we draw the ball from the $Q$-urn. $X$ is the colour of the ball we drew. Now consider the mutual information between the coinflip $Y$ and the colour of the ball $X$. We can write the mutual information in many ways, one of them is to write it as the average KL divergence between the conditional distributions $P_{X\vert Y=y}$ and the marginal $P_{X}$ $$ \mathbb{I}[X, Y] = \mathbb{E}_{y\sim P_Y} \operatorname{KL}\left[P_{X\vert Y=y}\middle\|P_{X}\right] $$ Now, using the facts that $P_{X\vert Y=1} = P$, $P_{X\vert Y=0}=Q$, and $\mathbb{P}(Y=1) = \mathbb{P}(Y=0)=\frac{1}{2}$, we can see that $$ P_X = \sum_{y\in \{0, 1\}} \mathbb{P}[Y=y]P_{X\vert Y=y} = \frac{P + Q}{2}. $$ And therefore \begin{align} \mathbb{I}[X, Y] &= \mathbb{E}_{y\sim P_Y} \operatorname{KL}\left[P_{X\vert Y=y}\middle\|P_{X}\right] \\ &= \sum_{y\in \{0, 1\}} \mathbb{P}[Y=y] \operatorname{KL}\left[P_{X\vert Y=y}\middle\|P_{X}\right]\\ &= \frac{1}{2}\operatorname{KL}\left[P\middle\|\frac{P+Q}{2}\right] + \frac{1}{2}\operatorname{KL}\left[Q\middle\|\frac{P+Q}{2}\right]\\ &= \operatorname{JS}[P,Q] \end{align} So this provides an intuitive interpretation for the JS divergence: We are given a sample $X$, we don't know if it's drawn from $P$ and $Q$. Without seeing the data, our best guess is $50-50$. After seeing the data, we'll be able to guess better, the uncertainty in $Y$ is reduced. The mutual information measures the average reduction in entropy (or uncertainty) in one variable after observing the other, which can be seen in this other alternative expression for the mutual information: $$ \mathbb{I}[X, Y] = \mathbb{H}[Y] - \mathbb{H}[Y\vert X] $$ The entropy and conditional entropy are both non-negative for discrete variables like $Y$. Since $Y$ is a $50-50$ coinflip, its entropy is maximal among binary random variables: $H[Y] = \log 2$. Hence, the JS divergence is upper bounded by $\log 2$. If $P$ and $Q$ distributions are completely different (imagine the two urns containing balls of completely different colours), we'll be able to tell exactly which urn the ball came from. So either we'll be $100%$ ceretain that $Y=0$ or $100%$ certain that $Y=1$. Thus, conditioned on $X$, the entropy of $Y$ is now $0$. Thus, the mutual information takes its maximal value, $\log 2$. It's only when $P$ and $Q$ have overlapping support (i.e. there is at least one colour that both urns contain) that the JS divergence becomes anything other than $\log 2$. ### Connection to GANs This interpretation of the JS divergence as a mutual information also explains how GANs minimise something related to the JS divergence. In a GAN case, instead of balls drawn from two urns, we are shown two images, one drawn from the real data distribution $P$ and the other drawn from the generator $Q$. The task of the discriminator is to predict the coinflip $Y$ that decided for each example which distribution it was drawn from. The cross-entropy loss (or log loss) of any distcriminator $d$ that tries to predict $Y$ given $X$ provides an upper bound on the conditional entropy: $$ \mathbb{H}[Y\vert X] \leq \mathbb{E}_{x,y\sim P_{X,Y}} -\log d(y\vert x) $$ This upper bound means that in effect the discriminator provides a lower bound on the JS divergence. Thus, we can interpret GAN training as follows: * the discriminator tries to approximate the JS divergence better by tightening the bound * the generator tries to minimize the lower bound on the JS divergence provided by the discriminator ## Minimizing a lower bound One can see how the GAN algorithm is, in fact, a very stupid algorithm, for two reasons: 1. when the support of $P$ and $Q$ are non-overlapping, the **JS divergence is in fact constant** $\log 2$ that doesn't locally depend on the generator's parameters. So the generator minimizes a lower bound to a constant function. If the lower bound would be perfect, the generator would get no useful gradient information. Which is the problem Martin and Anna talked about. For the generator to get any usful gradient signal we rely on the discriminator being too week, and the lower bound on the JS being too loose. Thus, GANs kind of work by accident, not for a very principled reason. Convergence of vanilla GANs relies on a very careful balancing act between how much we train the discriminator. 2. Even if the JS divergence wasn't constant, **minimising a lower bound** is still, generally, a poor idea. If you you have a loss function you'd like to minimize, finding an upper bound and minimizing that is always a good idea. This is because by decresing the upper bound you either decreased the original objective function, or made the bound tighter. Both of these are a good thing. However, if you minimize a lower bound to an objective function, decreasing the lower bound can mean that you decreased the objective function, or you made the bound less tight. The second option is not great. You may converge to a part of the parameter space where the original objective function value is nowhere near minimal, but the gap between the lower bound and the objective function is largest. The two ideas we discussed in the last two weeks are meant to solve the first problem: 1. Adding meaningful noise to the sample $X$ makes the JS divergence better behaved. This is because even if $P$ and $Q$ are non-overlapping, once we add noise, the distribution of noisy samples will be overlapping. This makes the JS divergence non-constant, which may therefore improve the algorithm. Here's [a writeup](https://www.inference.vc/instance-noise-a-trick-for-stabilising-gan-training/) on this. 2. We can also replace the JS divergence entirely, with the Wasserstein distance that remains meaningful even when $P$ and $Q$ don't overlap. The dual formulation suggests an algorithm very similar to GANs, but where the discriminator is regularised to stay Lipschitz-continuous. One way to think about how this helps is that we cripple the discriminator so that it cannot make perfect prediction of $Y$ even when $P$ and $Q$ are obviously separable. The strong smoothness constraints on $d$ prevent the discriminator from learning perfect discrimination even when it's possible. The two ideas are not entirely unrelated. In fact one can show that instance-noise can be considered a way of regularizing discrriminator gradients, thus connecting the instance noise idea with the gradient penalty idea that relates to Lipschitz regularization. Here's [a writeup](https://www.inference.vc/from-instance-noise-to-gradient-regularisation/) of this connection. Neither solution provides a workaround to the problem of minimizing a lower bound which continues to be a problem, in theory. If the family of discriminators cannot model some relationships, the generator can exploit this and and converge to a solution that doesn't minimize the Wasserstein distance, but instead minimizes the lower bound by developing artifacts that the critic cannot detect.