On Mutual Information Maximization for Representation Learning

# On Mutual Information Maximization for Representation Learning **Michael Tschannen, Josip Djolonga, Paul K. Rebunstein, Sylvain Gelly, Mario Lucic** [ICLR 2020](https://openreview.net/forum?id=rkxoh24FPH)  ---  ## Information Theory Basics  ### Entropy: - Entropy $H(X)$ of r.v. X: $H(X) = \mathbb{E}_{p_X}\left[\log\frac{1}{p_X(\mathbf{x})}\right]$  - Conditional Entropy $H(X\mid Y)$ of r.v. $X$ given r.v. $Y$: $H(X\mid Y) = \mathbb{E}_{p_{X,Y}}\left[\log\frac{1}{p_{X|Y}(\mathbf{x}\mid\mathbf{y})}\right]$  ### Kullback-Leibler (KL) Divergence: - KL Divergence $D_{KL}(p||q)$ between $p, q$ on the same support: $D_{KL}\left(p || q\right)=\mathbb{E}_{p}\left[\log \frac{p(\mathbf{x})}{q(\mathbf{x})}\right]$ - $D_{KL}(p||q)\geq 0\ (\forall\, p, q)$ - However, not a true distance function as it is not symmetric: $D_{KL}(p||q) \ne D_{KL}(q|| p)$  ### Mutual Information: - Mutual information $I(X; Y)$ between r.v. $X$ and $Y$: $I(X; Y) = \mathbb{E}_{p_{X, Y}}\left[\log\frac{p_{X, Y}(\mathbf{x}, \mathbf{y})}{p_X(\mathbf{x})\cdot p_Y(\mathbf{y})}\right]$ - $I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = I(Y; X)$ - <img src=https://i.imgur.com/a25RE0I.png width="800"> - $I(X; Y) = D_{KL}\left(p_{X, Y}|| p_X\cdot p_Y\right)$ and thus, $I(X; Y)\geq 0$  ### Properties of MI: #### Theorem 1: For smooth invertible functions $f_1, f_2$ of random variables $X, Y$ respectively, we have $$I(X; Y) = I(f_1(X_1), f_2(X_2)),$$  #### Theorem 2 [Data Processing Inequality]: For random variables $X, Y, Z$ satisfying Markov chain $X \rightarrow Y \rightarrow Z$, $$I(X; Y)\geq I(X; Z)\ \text{and}\ I(Y; Z)\geq I(X; Z)$$ In particular, for any function $f$, and random variables $X, Y$, $$I(X; Y)\geq I(X; f(Y))$$ Thus, the processing of a random variables by functions can not increase their MI. #### Theorem 3: For random variables $X, X_1, X_2$ and functions $g_1, g_2$, we have: $$I(X; g_1(X_1), g_2(X_2)) \geq I(g_1(X_1), g_2(X_2))$$ Thus, the information between the features forms a lower bound to that between data $X$ and the combined features $g_1(X_1), g_2(X_2)$ ## Estimation of MI ### Nguyen-Wainwright-Jordan (NWJ) Bound [(Nguyen et al., 2008)](https://arxiv.org/pdf/0809.0853.pdf)  #### Theorem For any real-valued function $f: \mathcal{X}\rightarrow \mathbb{R}$, and distributions $p, q$ on $\mathcal{X}$, $$ D_{KL}\left( p||q \right)\geq \mathbb{E}_p\left[ f(\mathbf{x}) \right] - e^{-1}\cdot\mathbb{E}_q\left[e^{f(\mathbf{x})}\right]$$In particular, for data $X\in\mathcal{X}$ and its representations $Z\in\mathcal{Z}$, and any function $f:\mathcal{X}\times\mathcal{Z}\rightarrow \mathbb{R}$, $$I(X; Z) \geq\mathbb{E}_{p_{X, Z}}\left[ f(\mathbf{x}, \mathbf{z}) \right] - e^{-1}\cdot\mathbb{E}_{p_X\cdot p_Z}\left[e^{f(\mathbf{x}, \mathbf{z})}\right] $$ ### Donsker Vardhan (DV) Bound [(Donsker and Varadhan, 1983)](https://onlinelibrary.wiley.com/doi/abs/10.1002/cpa.3160280102) - A black-box bound for KL-divergence that is __tighter than the NWJ bound__ - Used in Mutual Information Neural Estimation (MINE) [(Belghazi et al., 2018)](https://arxiv.org/pdf/1801.04062.pdf) as well #### Theorem For any real-valued function $f: \mathcal{X}\rightarrow \mathbb{R}$, and distributions $p, q$ on $\mathcal{X}$, $$ D_{KL}\left( p||q \right)\geq \mathbb{E}_{p}\left[f(\mathbf{x})\right] - \log \mathbb{E}_{q}\left[e^{f(\mathbf{x})}\right]$$In particular, for data $X\in\mathcal{X}$ and its representations $Z\in\mathcal{Z}$, and any function $f:\mathcal{X}\times\mathcal{Z}\rightarrow \mathbb{R}$, $$I(X; Z) \geq\mathbb{E}_{p_{X, Z}}\left[ f(\mathbf{x}, \mathbf{z}) \right] - \log\mathbb{E}_{p_X\cdot p_Z}\left[e^{f(\mathbf{x}, \mathbf{z})}\right]$$ ### InfoNCE Bound [(van den Oord et al., 2018)](https://arxiv.org/pdf/1807.03748.pdf)  #### Theorem For a real-valued function $f:\mathcal{X}\times\mathcal{Z}\rightarrow \mathbb{R}$ and for i.i.d. samples $(X_i, Z_i)_{i = 1}^K$ from distribution $p_{X, Z}$ over the same domain, we have: $$I(X; Z) \geq I_{\mathrm{NCE}}(X; Z) = \mathbb{E}\left[\frac{1}{K}\sum\nolimits_{i = 1}^K \log\frac{e^{f(\mathbf{x}_i, \mathbf{z}_i)}}{\frac{1}{K}\sum\nolimits_{j = 1}^K e^{f(\mathbf{x}_i, \mathbf{z}_j)}}\right]$$Here, the expectation is over the distribution of the i.i.d. samples $(X_i, Z_i)_{i = 1}^K$. * The variational function $f$ involved in these bounds is called the __Critic__ ## ... But Why Consider MI? - Recent advances in representation learning techniques that utilize MI maximization - Deep InfoMax, CPC, MINE, ... - __The InfoMax Formulation__ [(Linsker, 1988)](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.183.1750&rep=rep1&type=pdf)$$\text{Objective:}\ \max\nolimits_{g\in \mathcal{G}} I(X; g(X)) \\ \text{Select the representations $g(X)$ of the data $X$ that maximize their MI $I(X; g(X))$}$$ - Slight modification to enable __low-dimensional, tractable, end-to-end MI estimation__$$\text{Objective:}\ \max\nolimits_{g_1\in \mathcal{G}_1, g_2\in\mathcal{G}_2} I_{\mathrm{EST}}(g_1(X_1); g_2(X_2))$$ - $I_{\mathrm{EST}}$ is a sample-based estimator of the MI - $X_1, X_2$ are two __views__ of the data, which can be selected flexibly as per the approach - $\mathcal{G}_1, \mathcal{G}_2$ are the __encoder families__, which can be selected for appropriate structure prior - From Theorem 3, we are indeed optimizing a lower bound to $I(X; g(X))$ - This optimization is in latent space, which is usually lower-dimensional - __Gives a unifying framework for a large set of recent SSL approaches__ - In __contrastive approaches__, we optimize some lower bound of the mutual information - InfoMax is hiding in supervised tasks: - Consider supervised learning with inputs $X$ and targets $Y$ sampled from $p_{X, Y}$ - Suppose we do not model the distribution over the targets $Y$ (e.g., assume fixed) - $I(X; Y) = H(Y) + \mathbb{E}_{p_{X, Y}}\left[\log p_{Y\mid X}(\mathbf{y}\mid \mathbf{x})\right] \geq \mathrm{const.} + \mathbb{E}_{p_{X, Y}}\left[\log p_\theta(\mathbf{y}\mid \mathbf{x})\right]$ - Thus, __SSL with pretext tasks is also covered under InfoMax formulation__! - Idea of such unifying framework for NLP tasks is explored by [(Kong et al., 2020)](https://arxiv.org/pdf/1910.08350.pdf)  #### Since all this is good news, what could possibly go wrong!? --- # Formal Limitations on the Measurement of Mutual Information **David McAllester** and **Karl Stratos** [AISTATS 2020](http://proceedings.mlr.press/v108/mcallester20a/mcallester20a.pdf)  - This paper demonstrates that black-box high-confidence estimation of entropy, KL-divergence, and consequently mutual information, requires __exponentially many samples__ ## KL-Divergence Estimation has Inherent Limitations! #### Theorem For a multi-set $S$, a distribution $p_X$, and confidence parameter $\delta$, let $B(p_X, S, \delta)$ be a distribution-free high-confidence lower bound to $D_{KL}(p_X|| q_X)$ which is computed with full knowledge of $p_X$ and samples from $q_X$$$\color{blue}{B(p_X, S, \delta) \leq D_{KL}(p_X|| q_X)}\ \text{with probability at least $1 - \delta$ over sample $S\sim q_X^N$}$$ Then, for any such bound, for $N \geq 2$, for any $q_X$ we have:$$\color{red}{B(p_X, S, \delta) \leq \ln N}\ \text{with probability at least $1 - 4\cdot \delta$ over sample $S\sim q_X^N$}$$ ## Entropy Estimation has Inherent Limitations! #### Theorem Let $p_X$ be the data distribution, $T$ be a type, and $\delta$ be a confidence parameter. For a type $T$ and sample $S\sim p_X^N$, let $B(T, \delta)$ be a distribution-free high-confidence lower bound to $H_{p_X}(X)$: $$\color{blue}{B(T, \delta)\leq H_{p_X}(X)}\ \text{with probability at least $1 - \delta$ over sample $S\sim p_X^N$}$$Then, for any such bound, for $N \geq 50, k\geq 2$, and for any $p_X$, we have:$$\color{red}{B(T, \delta) \leq \ln 2\cdot k\cdot N^2}\ \text{with probability at least $1 - \delta - \frac{1.01}{k}$ over sample $S\sim p_X^N$}$$ - For discrete/discretized variables $X, Y$, we have: $I(X; Y) = H(X) - H(X|Y) \leq H(X)$ #### Corollary Let $B$ be a map from $N-$sized i.i.d. sample $(\mathbf{x}_i, \mathbf{y}_i)_{i = 1}^N$ from $p_{X, Y}$ such that:$$\color{blue}{B\left((\mathbf{x}_1, \mathbf{y}_1), \ldots, (\mathbf{x}_N, \mathbf{y}_N)\right) \leq I_{p_{X, Y}}(X; Y)}\ \text{with probability $0.99$ or more}$$Then, for any distribution $q_{X, Y}$, given $N\geq 50$ i.i.d. samples $(\mathbf{x}_i^\prime, \mathbf{y}_i^\prime)_{i = 1}^N$ from $q_{X, Y}$, we have:$$\color{red}{B\left((\mathbf{x}_1^\prime, \mathbf{y}_1^\prime), \ldots, (\mathbf{x}_N^\prime, \mathbf{x}_N^\prime)\right) \leq 2\cdot\ln N + 5}\ \text{with probability at least 0.96}$$ ## Lower Bounds May Have Bias/Variance Issues! - Follow-up work on the limitations of variational estimators of MI [(Song & Ermon, 2018)](https://arxiv.org/pdf/1910.06222.pdf) ### The Case of NWJ Estimator: $\color{red}{\text{Variance Exponential in MI}}$ - Estimator: $D_{KL}\left( p||q \right)\geq \mathbb{E}_p\left[f(\mathbf{x}) \right] - e^{-1}\cdot\mathbb{E}_q\left[e^{f(\mathbf{x})}\right]$; Optimal critic: $f^\ast(\mathbf{x}) = 1 + \log\frac{p(\mathbf{x})}{q(\mathbf{x})}$ - The Monte-Carlo version of the bound with i.i.d. samples $\mathbf{x}_{1:N}^p\sim p$ and $\mathbf{x}_{1:M}^q \sim q$ and the optimal critic $f^\ast$ has high variance:$$\text{var}\left(\frac{1}{M}\sum\nolimits_{j = 1}^M\frac{p(\mathbf{x}_j^q)}{q(\mathbf{x}_j^q)} \right) = \frac{1}{M}\text{var}_q\left(\frac{p(\mathbf{x})}{q(\mathbf{x})}\right) \geq \frac{\color{red}{e^{D_{KL}(p||q)}} - 1}{M}$$ - In particular, for $p = p_{X, Y}$ and $q = p(X)\cdot p(Y)$, $M$ must be exponential in $I(X; Y)$!! ### The Case of DV Estimator: $\color{red}{\text{Bias Exponential in MI}}$ - Estimator $D_{KL}\left( p||q \right)\geq \mathbb{E}_{p}\left[f(\mathbf{x})\right] - \log \mathbb{E}_{q}\left[e^{f(\mathbf{x})}\right]$; Optimal critic $f^\ast(\mathbf{x}) = \log\frac{p(\mathbf{x})}{q(\mathbf{x})}$ - The Monte-Carlo version of the bound becomes __biased__, which is $\mathcal{O}\left(e^{D_{KL}(p||q)}\right)$ - In particular, the Monte-Carlo version of DV bound has a bias exponential in MI!   --- ![](https://i.imgur.com/uNmT3Vd.png) > _"While we feel that information theory is indeed a valueable tool in providing fundamental insights [...], it is certainly no panacea [...]"_ > >\- Claude E. Shannon, IRE Transactions on Information Theory, 1956 --- # Biases in Approximate Information Maximization ## Empirical method #### Questions and hypotheses 1. InfoMax $\stackrel{?}{\implies}$ "good" representations 2. Does max $I_\text{est}$ preserve invertibility? 3. Larger critics for $I_\text{est}$ $\stackrel{?}{\implies}$ better performance 4. If $I_\text{est}$ is large and constant, then what matters? * architecture vs estimator #### Experimental Setup | Components | Default Setting | | -------------------------------------- |:----------------------------------------------:| | Dataset | **MNIST** | | Positives | top-half, $X_1$ and bottom-half, $X_2$ | | Encoder ($\mathbf{g}_1, \mathbf{g}_2$) | MLP | | Critic ($f$) | **Bilinear** $f(u, v) \triangleq u^\top W v$ | | Downstream Task | **Linear** Classification using $\mathbf{g}_1$ | | Optimizer | Adam for pre-training, SAGA for classification | ### A1: InfoMax $\not\Rightarrow$ "good" representations | Component | Experimental Setting | | --------- |:--------------------------------------------------------------------------------------------------------------:| | Encoder | [RealNVP](https://openreview.net/forum?id=HkpbnH9lx) (Dinh et al., ICLR 2017) (invertible = max & constant MI) | | Critic | No restrictions | ![](https://i.imgur.com/zyJ75ew.png) Observe: **$I_\text{est}$ increases, accuracy improves** Hints: There are encoders with max MI and worse accuracy | Component | Experimental Setting | | --------- |:-------------------------------:| | Encoder | Minimizes KL(uniform \| model) | | Critic | Minimizes NLL wrt Ground Labels | ![](https://i.imgur.com/89PuD5v.png) --- **Think about Normalizing Flows**: invertible but (possibly) complex geometries ![](https://invertibleworkshop.github.io/img/innf_logo.gif)* Using a _[Neural Autoregressive Flow](http://proceedings.mlr.press/v80/huang18d.html)_ (Huang et al., ICML 2018) | Source distribution | Target distribution | | ------------------- | ------------------- | | ![](https://i.imgur.com/Z011ARd.png) | ![](https://i.imgur.com/8ugMZIO.png) | > _"The information content alone is not sufficient to guarantee a **useful geometry** in the representation space."_ > \- from section 3.1 \* Logo of the [_ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models_](https://invertibleworkshop.github.io/) --- ### A2: Yes, but tendency for less smooth maps | Component | Experimental Setting | | --------- |:----------------------------------------------------------:| | Encoder | MLP with skip-connections + init to id | Accuracy increases ![](https://i.imgur.com/R8YdFZ3.png) **But** Jacobian has larger condition number <img src=https://i.imgur.com/cVYoUbP.png width="400"> Observe:  - numerically hard-to-invert However: - May still be invertible **on** the data manifold (if non-singular Jacobians everywhere on the support) - Local invariances of lower-dim support $\implies$ a Jacobian with e-values close to 0 --- ### A3: Larger critics $\not\Rightarrow$ "better" representations > _"A higher capacity critic should allow for a tighter lower-bound on MI"_ | Component | Experimental Setting | | --------- |:----------------------------------------------------------------------------------------------------------:| | Dataset | {MNIST, CIFAR-10} | | Critic | {**bilinear**: $u^\top W v$,**separable**: $\text{MLP}_1(u)^\top \text{MLP}_2(v)$ , $\text{MLP}([u, v])$ } | **MNIST** ![](https://i.imgur.com/00tv6lQ.png) **CIFAR-10** <img src=https://i.imgur.com/XavmtFt.png width="400"> Observe: $\text{MLP}([u, v])$ is larger, but not good feature geometry **vs** **bilinear** or **separate** utilize dot product (feature alignment) ### A4: Architecture matters more than estimator 1. Maximize $I_\text{est}$ until some bound $t$ 2. Ablate over architectures and $t$ ![](https://i.imgur.com/oJ0bCfO.png) Observe: * $I_\text{est}$ bound reached after 7k iters ![](https://i.imgur.com/W6tZMod.png) * CNN better than MLP in SLL * (Still higher $t$ $\implies$ better performance) $\implies$ the objectives do something (perhaps not MI related) --- ## Concluding Remarks * Connection to Metric Learning * Triplet Loss * (semi-)hard positive mining / curriculum learning?? * MI matters but not only * "Holistic view" > _"[...]explicitly take into account the function families one uses[...]"_ * Alternative measures of informations? > _"[...]explicitly account for the modelling power and computational constraints of the observer[...]"_ [A Theory of Usable Information under Computational Constraints](https://openreview.net/forum?id=r1eBeyHFDH) > Optimal Transport? > [Wasserstein Dependency Measure for Representation Learning](https://papers.nips.cc/paper/9692-wasserstein-dependency-measure-for-representation-learning) * Beyond linear evaluation protocol * _"[...]break away from the goal of estimating MI[...]"_ --- ## Thoughts & Discussion ### Don't We Already Know that CNNs Perform Better than MLPs? - Is this paper saying something along the lines of "CNNs are better than MLPs"? ### The Encoder-Critic Boundary - Representations that are best for downstream tasks are usually observed to be at the “middle” of the architecture used for SSL task - The representations of the shallower layers are too simple - The representations of the final layers are pretext-task specific  - <img src=https://i.imgur.com/hKMzfVh.png width="800"> - The encoder $g$ and the critic $f$ get trained together: $F = f\circ g$ - Different choices of $f, g$ result in the same $F$ but different downstream features ### <img src=https://i.imgur.com/BZErSNa.png width="800"> - $P_1, P_2$, and $P_3$ all learn the same modules A, B, C, D but the features $g_2(X)$, and not $g_1(X)$ or $g_3(X)$, are the best for downstream task. - Joint training of $f, g$ blurs the boundary between the encoder and the critic - $P_1:$ Powerful critic $+$ less powerful encoder $\implies$ the encoder expands towards the critic - $P_3$: Powerful encoder $+$ less powerful critic $\implies$ the critic expands towards the encoder - In practice, when we select $g, f$, how do we know if we are $P_1, P_2$ or $P_3$?  --- ## Design Considerations * InfoMax is necessary, but not sufficient * "good" representation $\implies$ InfoMax * InfoMax avoids explicitly mode collapse * InfoMax + <key factor\> $\implies$ "good" representations ? * Better variational families?? * Constraints on the latent space (like $\mathcal{S}^{m-1}$ / $L_2$ normalization) * Add regularization??? * Find objective imply both InfoMax and key factor * Make better positive joint distributions: $p_\text{pos}(x, x^+)$ * Implies the negative distribution (joint vs product of marginals) * Defines transformations to be invariant  --- # Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere **Tongzhou Wang** and **Phillip Isola** [ICML 2020](https://arxiv.org/abs/2005.10242) ## $L_2$ normalization trending in representation learning ### Geometry on feature space: **hypersphere** 1. improved **learning stability** No norm: softmax can become arbitrarily sharp by scaling features 2. sufficiently well-clustered $\implies$ **linear separability** ![](https://i.imgur.com/tOs0L7K.png) ## Desired properties on the hypersphere ### Uniformity $\stackrel{?}{\iff}$ InfoMax ![](https://i.imgur.com/8aQrDJj.png) Remember? ![](https://i.imgur.com/Fkb14DM.png) ### Alignment $\stackrel{?}{\iff}$ Invariance to "noise" ![](https://i.imgur.com/JjtGmCd.png) ## Relation to contrastive framework ### Intuition (section 4, 4.1) **Positives** for alignment **Negatives** for uniformity $$ \begin{align} &\mathcal{L}_\text{contrastive}(f; \tau, M) \triangleq \\ &\mathop{\mathbb{E}}_{\substack{(x,y) \sim p_\text{pos} \\ \{x_i^-\}_{i=1}^M \sim p_\text{data}^{\otimes M}}} \left[ - \log \frac{e^{f(x)^\top f(y) / \tau}}{e^{f(x)^\top f(y) / \tau} + \sum_i e^{f(x_i^-)^\top f(y) / \tau}} \right] \end{align} $$ ### Formally (section 4.2) $$ \begin{align} \lim_{M \to \infty} &\mathcal{L}_\text{contrastive}(f; \tau, M) - \log M = &&\\ &- \frac{1}{\tau} \mathop{\mathbb{E}}_{(x,y) \sim p_\text{pos}} f(x)^\top f(y) & \mathcal{L}_\text{align}(f;\alpha) \triangleq& - \mathop{\mathbb{E}}_{(x,y) \sim p_\text{pos}} \left\|f(x) - f(y) \right\|_2^\alpha \\ &+ \mathop{\mathbb{E}}_{x \sim p_\text{data}} \log \mathop{\mathbb{E}}_{x^- \sim p_\text{data}} e^{\frac{f(x)^\top f(x^-)}{\tau}} & \mathcal{L}_\text{uniform}(f; t) \triangleq & \quad\log \mathop{\mathbb{E}}_{(x,y) \sim p_\text{data}^{\otimes 2}} e^{-t \, \|f(x) - f(y)\|_2^2} \end{align} $$ #### Notes on InfoMax $I(f(x);f(y)) = H(f(x)) - H(f(x)|f(y))$ * max uniformity on hypersphere $\iff$ max $H(f(x))$ * max alignment $\implies$ min $H(f(x) | f(y))$ ## Experiments ### CIFAR-10 projected on $\mathcal{S}^1$ | Random Init | Supervised Classification | SSL Contrastive | |:------------------------------------:|:------------------------------------:|:------------------------------------:| | ![](https://i.imgur.com/ubapTFU.png) | ![](https://i.imgur.com/w8vzbSx.png) | ![](https://i.imgur.com/wkuXUUo.png) | | Val Acc: $12.71\%$ | Val Acc: $57.19\%$ | Val Acc: $28.60\%$ | ### Contrastive vs Direct Optimization **STL-10** (CNN+BN) <img src=https://i.imgur.com/CSrISKV.png width="400"> **BookCorpus** (bi-GRU) <img src=https://i.imgur.com/oeFRaiY.png width="400"> **ImageNet-100** (MoCo-based ResNet50) <img src=https://i.imgur.com/HdPe33E.png width="400"> ### Ablation <img src=https://i.imgur.com/hnkIG4G.png width="480">