# On Mutual Information Maximization for Representation Learning
**Michael Tschannen, Josip Djolonga, Paul K. Rebunstein, Sylvain Gelly, Mario Lucic**
[ICLR 2020](https://openreview.net/forum?id=rkxoh24FPH)
<!-- peeeeerfect
https://arxiv.org/abs/2005.10242
https://openreview.net/forum?id=rkxoh24FPH
https://openreview.net/forum?id=ZbXlSL_xwtA
https://arxiv.org/abs/2007.00810
---
https://arxiv.org/pdf/2001.09122.pdf
There is this paper that provides an account of generalizability in terms of conditional mutual information. Found from a thread in Mila slack!
---
Maybe the relevant questions into slides! -->
---
<!---TODO Eeshan Begins--->
## Information Theory Basics
<!---[~10 Slides]--->
### Entropy:
- Entropy $H(X)$ of r.v. X: $H(X) = \mathbb{E}_{p_X}\left[\log\frac{1}{p_X(\mathbf{x})}\right]$
<!---
- The amount of uncertainty in the values taken by $X$
- Defined as $H(X) = \mathbb{E}_{p_X}\left[\log\frac{1}{p_X(\mathbf{x})}\right]$
--->
- Conditional Entropy $H(X\mid Y)$ of r.v. $X$ given r.v. $Y$: $H(X\mid Y) = \mathbb{E}_{p_{X,Y}}\left[\log\frac{1}{p_{X|Y}(\mathbf{x}\mid\mathbf{y})}\right]$
<!---
- The amount of residual uncertainty in $X$ if $Y$ is known
- Defined as $H(X\mid Y) = \mathbb{E}_{p_{X,Y}}\left[\log\frac{1}{p_{X|Y}(\mathbf{x}\mid\mathbf{y})}\right]$
--->
### Kullback-Leibler (KL) Divergence:
- KL Divergence $D_{KL}(p||q)$ between $p, q$ on the same support: $D_{KL}\left(p || q\right)=\mathbb{E}_{p}\left[\log \frac{p(\mathbf{x})}{q(\mathbf{x})}\right]$
- $D_{KL}(p||q)\geq 0\ (\forall\, p, q)$
- However, not a true distance function as it is not symmetric: $D_{KL}(p||q) \ne D_{KL}(q|| p)$
<!---
- Measures how much a distribution $q$ differs from another distribution $p$
- Defined as $D_{KL}\left(p || q\right)=\mathbb{E}_{p}\left[\log \frac{p(\mathbf{x})}{q(\mathbf{x})}\right]$
- $D_{KL}(p||q)\geq 0\ (\forall\, p, q)$; i.e., always non-negative for any distributions $p, q$
- However, not a true distance function as it is not symmetric: $D_{KL}(p||q) \ne D_{KL}(q|| p)$
--->
### Mutual Information:
- Mutual information $I(X; Y)$ between r.v. $X$ and $Y$: $I(X; Y) = \mathbb{E}_{p_{X, Y}}\left[\log\frac{p_{X, Y}(\mathbf{x}, \mathbf{y})}{p_X(\mathbf{x})\cdot p_Y(\mathbf{y})}\right]$
- $I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = I(Y; X)$
- <img src=https://i.imgur.com/a25RE0I.png width="800">
- $I(X; Y) = D_{KL}\left(p_{X, Y}|| p_X\cdot p_Y\right)$ and thus, $I(X; Y)\geq 0$
<!---
- Quantifies the information contained in $X$ about $Y$ (or vice versa)
- Reduction in the uncertainty of $X$ given $Y$ (or vice versa)
- Equals the KL-divergence between the joint distribution $p_{X, Y}$ and the product of marginals $p_X\cdot p_Y$ and consequently, always non-negative
- $I(X; Y) = \mathbb{E}_{p_{X, Y}}\left[\log\frac{p(\mathbf{x}, \mathbf{y})}{p(\mathbf{x})\cdot p(\mathbf{y})}\right] = H(X) - H(X|Y) = H(Y) - H(Y|X) = I(Y; X)$
- <img src=https://i.imgur.com/a25RE0I.png width="800">
- We also have $I(X; Y) = D_{KL}\left(p_{X, Y}|| p_X\cdot p_Y\right)$ and thus, $I(X; Y)\geq 0$
--->
### Properties of MI:
#### Theorem 1:
For smooth invertible functions $f_1, f_2$ of random variables $X, Y$ respectively, we have
$$I(X; Y) = I(f_1(X_1), f_2(X_2)),$$
<!---
i.e., MI between remains preserved with smooth invertible functions.
--->
#### Theorem 2 [Data Processing Inequality]:
For random variables $X, Y, Z$ satisfying Markov chain $X \rightarrow Y \rightarrow Z$, $$I(X; Y)\geq I(X; Z)\ \text{and}\ I(Y; Z)\geq I(X; Z)$$ In particular, for any function $f$, and random variables $X, Y$, $$I(X; Y)\geq I(X; f(Y))$$ Thus, the processing of a random variables by functions can not increase their MI.
#### Theorem 3:
For random variables $X, X_1, X_2$ and functions $g_1, g_2$, we have: $$I(X; g_1(X_1), g_2(X_2)) \geq I(g_1(X_1), g_2(X_2))$$ Thus, the information between the features forms a lower bound to that between data $X$ and the combined features $g_1(X_1), g_2(X_2)$
## Estimation of MI
### Nguyen-Wainwright-Jordan (NWJ) Bound [(Nguyen et al., 2008)](https://arxiv.org/pdf/0809.0853.pdf)
<!---
- A black-box bound for KL-divergence, which can be used for $I(X; Y)$ estimation
--->
#### Theorem
For any real-valued function $f: \mathcal{X}\rightarrow \mathbb{R}$, and distributions $p, q$ on $\mathcal{X}$, $$ D_{KL}\left( p||q \right)\geq \mathbb{E}_p\left[ f(\mathbf{x}) \right] - e^{-1}\cdot\mathbb{E}_q\left[e^{f(\mathbf{x})}\right]$$In particular, for data $X\in\mathcal{X}$ and its representations $Z\in\mathcal{Z}$, and any function $f:\mathcal{X}\times\mathcal{Z}\rightarrow \mathbb{R}$, $$I(X; Z) \geq\mathbb{E}_{p_{X, Z}}\left[ f(\mathbf{x}, \mathbf{z}) \right] - e^{-1}\cdot\mathbb{E}_{p_X\cdot p_Z}\left[e^{f(\mathbf{x}, \mathbf{z})}\right] $$
### Donsker Vardhan (DV) Bound [(Donsker and Varadhan, 1983)](https://onlinelibrary.wiley.com/doi/abs/10.1002/cpa.3160280102)
- A black-box bound for KL-divergence that is __tighter than the NWJ bound__
- Used in Mutual Information Neural Estimation (MINE) [(Belghazi et al., 2018)](https://arxiv.org/pdf/1801.04062.pdf) as well
#### Theorem
For any real-valued function $f: \mathcal{X}\rightarrow \mathbb{R}$, and distributions $p, q$ on $\mathcal{X}$, $$ D_{KL}\left( p||q \right)\geq \mathbb{E}_{p}\left[f(\mathbf{x})\right] - \log \mathbb{E}_{q}\left[e^{f(\mathbf{x})}\right]$$In particular, for data $X\in\mathcal{X}$ and its representations $Z\in\mathcal{Z}$, and any function $f:\mathcal{X}\times\mathcal{Z}\rightarrow \mathbb{R}$, $$I(X; Z) \geq\mathbb{E}_{p_{X, Z}}\left[ f(\mathbf{x}, \mathbf{z}) \right] - \log\mathbb{E}_{p_X\cdot p_Z}\left[e^{f(\mathbf{x}, \mathbf{z})}\right]$$
### InfoNCE Bound [(van den Oord et al., 2018)](https://arxiv.org/pdf/1807.03748.pdf)
<!---
- A consequence of the NWJ bound with multiple (i.i.d.) samples (Proof in Appendix-D)
- Recently, this objective (or slight variants of it) have been extensively used
--->
#### Theorem
For a real-valued function $f:\mathcal{X}\times\mathcal{Z}\rightarrow \mathbb{R}$ and for i.i.d. samples $(X_i, Z_i)_{i = 1}^K$ from distribution $p_{X, Z}$ over the same domain, we have: $$I(X; Z) \geq I_{\mathrm{NCE}}(X; Z) = \mathbb{E}\left[\frac{1}{K}\sum\nolimits_{i = 1}^K \log\frac{e^{f(\mathbf{x}_i, \mathbf{z}_i)}}{\frac{1}{K}\sum\nolimits_{j = 1}^K e^{f(\mathbf{x}_i, \mathbf{z}_j)}}\right]$$Here, the expectation is over the distribution of the i.i.d. samples $(X_i, Z_i)_{i = 1}^K$.
* The variational function $f$ involved in these bounds is called the __Critic__
## ... But Why Consider MI?
- Recent advances in representation learning techniques that utilize MI maximization
- Deep InfoMax, CPC, MINE, ...
- __The InfoMax Formulation__ [(Linsker, 1988)](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.183.1750&rep=rep1&type=pdf)$$\text{Objective:}\ \max\nolimits_{g\in \mathcal{G}} I(X; g(X)) \\ \text{Select the representations $g(X)$ of the data $X$ that maximize their MI $I(X; g(X))$}$$
- Slight modification to enable __low-dimensional, tractable, end-to-end MI estimation__$$\text{Objective:}\ \max\nolimits_{g_1\in \mathcal{G}_1, g_2\in\mathcal{G}_2} I_{\mathrm{EST}}(g_1(X_1); g_2(X_2))$$
- $I_{\mathrm{EST}}$ is a sample-based estimator of the MI
- $X_1, X_2$ are two __views__ of the data, which can be selected flexibly as per the approach
- $\mathcal{G}_1, \mathcal{G}_2$ are the __encoder families__, which can be selected for appropriate structure prior
- From Theorem 3, we are indeed optimizing a lower bound to $I(X; g(X))$
- This optimization is in latent space, which is usually lower-dimensional
- __Gives a unifying framework for a large set of recent SSL approaches__
- In __contrastive approaches__, we optimize some lower bound of the mutual information
- InfoMax is hiding in supervised tasks:
- Consider supervised learning with inputs $X$ and targets $Y$ sampled from $p_{X, Y}$
- Suppose we do not model the distribution over the targets $Y$ (e.g., assume fixed)
- $I(X; Y) = H(Y) + \mathbb{E}_{p_{X, Y}}\left[\log p_{Y\mid X}(\mathbf{y}\mid \mathbf{x})\right] \geq \mathrm{const.} + \mathbb{E}_{p_{X, Y}}\left[\log p_\theta(\mathbf{y}\mid \mathbf{x})\right]$
- Thus, __SSL with pretext tasks is also covered under InfoMax formulation__!
- Idea of such unifying framework for NLP tasks is explored by [(Kong et al., 2020)](https://arxiv.org/pdf/1910.08350.pdf)
<!---
- Pretext tasks involve creating _components_/_views_ $X_1, X_2$ of the data and try to fit a model to predict $X_2$ from $X_1$ in a supervised manner! Thus, we can say that$$\text{Objective of SSL:}\ \max\nolimits_{p_\theta(\mathbf{x}_2\mid\mathbf{x}_1)\in \mathcal{P}} I(X_1, X_2)$$
--->
#### Since all this is good news, what could possibly go wrong!?
---
# Formal Limitations on the Measurement of Mutual Information
**David McAllester** and **Karl Stratos**
[AISTATS 2020](http://proceedings.mlr.press/v108/mcallester20a/mcallester20a.pdf)
<!---
- The paper we are presenting demonstrates an __empirical__ account of the problems with MI
- However, it is important to have a __theoretical__ account of the same!
--->
- This paper demonstrates that black-box high-confidence estimation of entropy, KL-divergence, and consequently mutual information, requires __exponentially many samples__
## KL-Divergence Estimation has Inherent Limitations!
#### Theorem
For a multi-set $S$, a distribution $p_X$, and confidence parameter $\delta$, let $B(p_X, S, \delta)$ be a distribution-free high-confidence lower bound to $D_{KL}(p_X|| q_X)$ which is computed with full knowledge of $p_X$ and samples from $q_X$$$\color{blue}{B(p_X, S, \delta) \leq D_{KL}(p_X|| q_X)}\ \text{with probability at least $1 - \delta$ over sample $S\sim q_X^N$}$$ Then, for any such bound, for $N \geq 2$, for any $q_X$ we have:$$\color{red}{B(p_X, S, \delta) \leq \ln N}\ \text{with probability at least $1 - 4\cdot \delta$ over sample $S\sim q_X^N$}$$
## Entropy Estimation has Inherent Limitations!
#### Theorem
Let $p_X$ be the data distribution, $T$ be a type, and $\delta$ be a confidence parameter. For a type $T$ and sample $S\sim p_X^N$, let $B(T, \delta)$ be a distribution-free high-confidence lower bound to $H_{p_X}(X)$: $$\color{blue}{B(T, \delta)\leq H_{p_X}(X)}\ \text{with probability at least $1 - \delta$ over sample $S\sim p_X^N$}$$Then, for any such bound, for $N \geq 50, k\geq 2$, and for any $p_X$, we have:$$\color{red}{B(T, \delta) \leq \ln 2\cdot k\cdot N^2}\ \text{with probability at least $1 - \delta - \frac{1.01}{k}$ over sample $S\sim p_X^N$}$$
- For discrete/discretized variables $X, Y$, we have: $I(X; Y) = H(X) - H(X|Y) \leq H(X)$
#### Corollary
Let $B$ be a map from $N-$sized i.i.d. sample $(\mathbf{x}_i, \mathbf{y}_i)_{i = 1}^N$ from $p_{X, Y}$ such that:$$\color{blue}{B\left((\mathbf{x}_1, \mathbf{y}_1), \ldots, (\mathbf{x}_N, \mathbf{y}_N)\right) \leq I_{p_{X, Y}}(X; Y)}\ \text{with probability $0.99$ or more}$$Then, for any distribution $q_{X, Y}$, given $N\geq 50$ i.i.d. samples $(\mathbf{x}_i^\prime, \mathbf{y}_i^\prime)_{i = 1}^N$ from $q_{X, Y}$, we have:$$\color{red}{B\left((\mathbf{x}_1^\prime, \mathbf{y}_1^\prime), \ldots, (\mathbf{x}_N^\prime, \mathbf{x}_N^\prime)\right) \leq 2\cdot\ln N + 5}\ \text{with probability at least 0.96}$$
## Lower Bounds May Have Bias/Variance Issues!
- Follow-up work on the limitations of variational estimators of MI [(Song & Ermon, 2018)](https://arxiv.org/pdf/1910.06222.pdf)
### The Case of NWJ Estimator: $\color{red}{\text{Variance Exponential in MI}}$
- Estimator: $D_{KL}\left( p||q \right)\geq \mathbb{E}_p\left[f(\mathbf{x}) \right] - e^{-1}\cdot\mathbb{E}_q\left[e^{f(\mathbf{x})}\right]$; Optimal critic: $f^\ast(\mathbf{x}) = 1 + \log\frac{p(\mathbf{x})}{q(\mathbf{x})}$
- The Monte-Carlo version of the bound with i.i.d. samples $\mathbf{x}_{1:N}^p\sim p$ and $\mathbf{x}_{1:M}^q \sim q$ and the optimal critic $f^\ast$ has high variance:$$\text{var}\left(\frac{1}{M}\sum\nolimits_{j = 1}^M\frac{p(\mathbf{x}_j^q)}{q(\mathbf{x}_j^q)} \right) = \frac{1}{M}\text{var}_q\left(\frac{p(\mathbf{x})}{q(\mathbf{x})}\right) \geq \frac{\color{red}{e^{D_{KL}(p||q)}} - 1}{M}$$
- In particular, for $p = p_{X, Y}$ and $q = p(X)\cdot p(Y)$, $M$ must be exponential in $I(X; Y)$!!
### The Case of DV Estimator: $\color{red}{\text{Bias Exponential in MI}}$
- Estimator $D_{KL}\left( p||q \right)\geq \mathbb{E}_{p}\left[f(\mathbf{x})\right] - \log \mathbb{E}_{q}\left[e^{f(\mathbf{x})}\right]$; Optimal critic $f^\ast(\mathbf{x}) = \log\frac{p(\mathbf{x})}{q(\mathbf{x})}$
- The Monte-Carlo version of the bound becomes __biased__, which is $\mathcal{O}\left(e^{D_{KL}(p||q)}\right)$
- In particular, the Monte-Carlo version of DV bound has a bias exponential in MI!
<!--- - The estimation of mutual information is difficult; we usually optimize a tractable bound/estimator of MI in order to learn features
--->
<!--- - Estimators of MI:
- NWJ
- Donsker-Varadhan
- Formal Limitation Theorem
- The curse of estimation with exponentially many samples in terms of the mutual info
- Higher information is exponentially difficult to estimate
- Continuous variables may have infinite information!
- Limitations in bias (DV) and variance (NWJ)
- The role of dynamics in Monte-Carlo
- InfoNCE
- Other estimators:
- Variational Estimators:
- Benefits: Becomes better with more expressive
- Drawbacks: More expressive priors? How to obtain better more expressive families!
- Estimator in latent space only (lower-dimensional computations)
- BUT IT IS NEITHER UPPER NOT LOWER
## Info-Max Framework for SSL/Pretext Tasks
- Data processing inequality (mention/visualize)
- Mention the framework
--->
---

> _"While we feel that information theory is indeed a valueable tool in providing fundamental insights [...], it is certainly no panacea [...]"_
>
>\- Claude E. Shannon, IRE Transactions on Information Theory, 1956
---
# Biases in Approximate Information Maximization
## Empirical method
#### Questions and hypotheses
1. InfoMax $\stackrel{?}{\implies}$ "good" representations
2. Does max $I_\text{est}$ preserve invertibility?
3. Larger critics for $I_\text{est}$ $\stackrel{?}{\implies}$ better performance
4. If $I_\text{est}$ is large and constant, then what matters?
* architecture vs estimator
#### Experimental Setup
| Components | Default Setting |
| -------------------------------------- |:----------------------------------------------:|
| Dataset | **MNIST** |
| Positives | top-half, $X_1$ and bottom-half, $X_2$ |
| Encoder ($\mathbf{g}_1, \mathbf{g}_2$) | MLP |
| Critic ($f$) | **Bilinear** $f(u, v) \triangleq u^\top W v$ |
| Downstream Task | **Linear** Classification using $\mathbf{g}_1$ |
| Optimizer | Adam for pre-training, SAGA for classification |
### A1: InfoMax $\not\Rightarrow$ "good" representations
| Component | Experimental Setting |
| --------- |:--------------------------------------------------------------------------------------------------------------:|
| Encoder | [RealNVP](https://openreview.net/forum?id=HkpbnH9lx) (Dinh et al., ICLR 2017) (invertible = max & constant MI) |
| Critic | No restrictions |

Observe: **$I_\text{est}$ increases, accuracy improves**
Hints: There are encoders with max MI and worse accuracy
| Component | Experimental Setting |
| --------- |:-------------------------------:|
| Encoder | Minimizes KL(uniform \| model) |
| Critic | Minimizes NLL wrt Ground Labels |

---
**Think about Normalizing Flows**: invertible but (possibly) complex geometries
*
Using a _[Neural Autoregressive Flow](http://proceedings.mlr.press/v80/huang18d.html)_ (Huang et al., ICML 2018)
| Source distribution | Target distribution |
| ------------------- | ------------------- |
|  |  |
> _"The information content alone is not sufficient to guarantee a **useful geometry** in the representation space."_
> \- from section 3.1
\* Logo of the [_ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models_](https://invertibleworkshop.github.io/)
---
### A2: Yes, but tendency for less smooth maps
| Component | Experimental Setting |
| --------- |:----------------------------------------------------------:|
| Encoder | MLP with skip-connections + init to id |
Accuracy increases

**But** Jacobian has larger condition number
<img src=https://i.imgur.com/cVYoUbP.png width="400">
Observe:
<!-- - Mathematically invertible: non-singular Jacobians $\implies$ locally invertible (inverse function theorem), -->
- numerically hard-to-invert
However:
- May still be invertible **on** the data manifold (if non-singular Jacobians everywhere on the support)
- Local invariances of lower-dim support $\implies$ a Jacobian with e-values close to 0
---
### A3: Larger critics $\not\Rightarrow$ "better" representations
> _"A higher capacity critic should allow for a tighter lower-bound on MI"_
| Component | Experimental Setting |
| --------- |:----------------------------------------------------------------------------------------------------------:|
| Dataset | {MNIST, CIFAR-10} |
| Critic | {**bilinear**: $u^\top W v$,**separable**: $\text{MLP}_1(u)^\top \text{MLP}_2(v)$ , $\text{MLP}([u, v])$ } |
**MNIST**

**CIFAR-10**
<img src=https://i.imgur.com/XavmtFt.png width="400">
Observe:
$\text{MLP}([u, v])$ is larger, but not good feature geometry
**vs**
**bilinear** or **separate** utilize dot product (feature alignment)
### A4: Architecture matters more than estimator
1. Maximize $I_\text{est}$ until some bound $t$
2. Ablate over architectures and $t$

Observe:
* $I_\text{est}$ bound reached after 7k iters

* CNN better than MLP in SLL
* (Still higher $t$ $\implies$ better performance) $\implies$ the objectives do something (perhaps not MI related)
---
## Concluding Remarks
* Connection to Metric Learning
* Triplet Loss
* (semi-)hard positive mining / curriculum learning??
* MI matters but not only
* "Holistic view"
> _"[...]explicitly take into account the function families one uses[...]"_
* Alternative measures of informations?
> _"[...]explicitly account for the modelling power and computational constraints of the observer[...]"_
[A Theory of Usable Information under Computational Constraints](https://openreview.net/forum?id=r1eBeyHFDH)
> Optimal Transport?
> [Wasserstein Dependency Measure for Representation Learning](https://papers.nips.cc/paper/9692-wasserstein-dependency-measure-for-representation-learning)
* Beyond linear evaluation protocol
* _"[...]break away from the goal of estimating MI[...]"_
---
## Thoughts & Discussion
### Don't We Already Know that CNNs Perform Better than MLPs?
- Is this paper saying something along the lines of "CNNs are better than MLPs"?
### The Encoder-Critic Boundary
- Representations that are best for downstream tasks are usually observed to be at the “middle” of the architecture used for SSL task
- The representations of the shallower layers are too simple
- The representations of the final layers are pretext-task specific
<!--- Consider a NN with modules A, B, C, D which outputs features $f(X)$ from data $X$ and is learned with some SSL objective-->
- <img src=https://i.imgur.com/hKMzfVh.png width="800">
- The encoder $g$ and the critic $f$ get trained together: $F = f\circ g$
- Different choices of $f, g$ result in the same $F$ but different downstream features
### <img src=https://i.imgur.com/BZErSNa.png width="800">
- $P_1, P_2$, and $P_3$ all learn the same modules A, B, C, D but the features $g_2(X)$, and not $g_1(X)$ or $g_3(X)$, are the best for downstream task.
- Joint training of $f, g$ blurs the boundary between the encoder and the critic
- $P_1:$ Powerful critic $+$ less powerful encoder $\implies$ the encoder expands towards the critic
- $P_3$: Powerful encoder $+$ less powerful critic $\implies$ the critic expands towards the encoder
- In practice, when we select $g, f$, how do we know if we are $P_1, P_2$ or $P_3$?
<!---
TODO Eeshan Begins
## Doubt:
Saying MI is not sufficient and that role of a CNN matters, it is a tautology!
Supervised classification does improve with CNNs, rather than with MLPs.
---
Show the experimentation results
Why the choice of encoder-critic matter: proposed idea
Hypothesis--
1. The shallower layers are primitive representations
2. The deepest layers are task specific: InfoNCE/Pretext
3. The good representations would be in-between
Thought Experiment--
1. P1: Encoder: A, Critic: BCD
2. P2: Encoder: AB, Critic: CD
3. P3: Encoder: ABC, Critic: D
Proposed mechanism--
* If the critic is overpowered, the encoder overflows into the critic and vice versa
**The actual critic vs the actual encoder: WHERE DOES THE CRITIC BEGIN?**
X -> A - B - C - D
<------------- the better/tighter estimation of MI you have
the deeper the critic is, the more capacity the variational family has, the tighter estimation you get
--->
---
## Design Considerations
* InfoMax is necessary, but not sufficient
* "good" representation $\implies$ InfoMax
* InfoMax avoids explicitly mode collapse
* InfoMax + <key factor\> $\implies$ "good" representations ?
* Better variational families??
* Constraints on the latent space (like $\mathcal{S}^{m-1}$ / $L_2$ normalization)
* Add regularization???
* Find objective imply both InfoMax and key factor
* Make better positive joint distributions: $p_\text{pos}(x, x^+)$
* Implies the negative distribution (joint vs product of marginals)
* Defines transformations to be invariant
<!--- (3. $I(g_1(X_1); g_2(X_2))$: Is this a very poor lower bound on $I(X_1; X_2)$? Maybe optimizing a better lower bound would be helpful) --->
---
# Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere
**Tongzhou Wang** and **Phillip Isola**
[ICML 2020](https://arxiv.org/abs/2005.10242)
## $L_2$ normalization trending in representation learning
### Geometry on feature space: **hypersphere**
1. improved **learning stability**
No norm: softmax can become arbitrarily sharp by scaling features
2. sufficiently well-clustered $\implies$ **linear separability**

## Desired properties on the hypersphere
### Uniformity $\stackrel{?}{\iff}$ InfoMax

Remember?

### Alignment $\stackrel{?}{\iff}$ Invariance to "noise"

## Relation to contrastive framework
### Intuition (section 4, 4.1)
**Positives** for alignment
**Negatives** for uniformity
$$
\begin{align}
&\mathcal{L}_\text{contrastive}(f; \tau, M) \triangleq \\
&\mathop{\mathbb{E}}_{\substack{(x,y) \sim p_\text{pos} \\ \{x_i^-\}_{i=1}^M \sim p_\text{data}^{\otimes M}}} \left[ - \log \frac{e^{f(x)^\top f(y) / \tau}}{e^{f(x)^\top f(y) / \tau} + \sum_i e^{f(x_i^-)^\top f(y) / \tau}} \right]
\end{align}
$$
### Formally (section 4.2)
$$
\begin{align}
\lim_{M \to \infty} &\mathcal{L}_\text{contrastive}(f; \tau, M) - \log M = &&\\
&- \frac{1}{\tau} \mathop{\mathbb{E}}_{(x,y) \sim p_\text{pos}} f(x)^\top f(y) &
\mathcal{L}_\text{align}(f;\alpha) \triangleq& - \mathop{\mathbb{E}}_{(x,y) \sim p_\text{pos}} \left\|f(x) - f(y) \right\|_2^\alpha \\
&+ \mathop{\mathbb{E}}_{x \sim p_\text{data}} \log \mathop{\mathbb{E}}_{x^- \sim p_\text{data}} e^{\frac{f(x)^\top f(x^-)}{\tau}} &
\mathcal{L}_\text{uniform}(f; t) \triangleq & \quad\log \mathop{\mathbb{E}}_{(x,y) \sim p_\text{data}^{\otimes 2}} e^{-t \, \|f(x) - f(y)\|_2^2}
\end{align}
$$
#### Notes on InfoMax
$I(f(x);f(y)) = H(f(x)) - H(f(x)|f(y))$
* max uniformity on hypersphere $\iff$ max $H(f(x))$
* max alignment $\implies$ min $H(f(x) | f(y))$
## Experiments
### CIFAR-10 projected on $\mathcal{S}^1$
| Random Init | Supervised Classification | SSL Contrastive |
|:------------------------------------:|:------------------------------------:|:------------------------------------:|
|  |  |  |
| Val Acc: $12.71\%$ | Val Acc: $57.19\%$ | Val Acc: $28.60\%$ |
### Contrastive vs Direct Optimization
**STL-10** (CNN+BN)
<img src=https://i.imgur.com/CSrISKV.png width="400">
**BookCorpus** (bi-GRU)
<img src=https://i.imgur.com/oeFRaiY.png width="400">
**ImageNet-100** (MoCo-based ResNet50)
<img src=https://i.imgur.com/HdPe33E.png width="400">
### Ablation
<img src=https://i.imgur.com/hnkIG4G.png width="480">