# Representation and Transfer Learning
### Ferenc Huszár (fh277)
<!-- Put the link to this slide here so people can follow -->
slides: https://hackmd.io/@fhuszar/r1HxvooMd
---
### Unsupervised learning
- observations $x_1, x_2, \ldots$
- drawn i.i.d. from some $p_\mathcal{D}$
- can we learn something from this?
---
### Unsupervised learning goals
- can we learn something from this?
- a model of data distribution $p_\theta(x) \approx p_{\mathcal{D}}(x)$
- compression
- data reconstruction
- sampling/generation
- a representation $z=g_\theta(x)$ or $q_{\theta}(z\vert x)$
- downstream classification task
- data visualisation
---
### UL as distribution modeling
- defines goal as modeling $p_\theta(x)\approx p_\mathcal{D}(x)$
- $\theta$: parameters
- maximum likelihood estimation:
$$
\theta^{ML} = \operatorname{argmax}_\theta \sum_{x_i \in \mathcal{D}} \log p_\theta(x_i)
$$
---
### Deep learning for modelling distributions
* auto-regressive models (e.g. RNNs)
* $p_{\theta}(x_{1:T}) = \prod_{t=1}^T p_\theta(x_t\vert x_{1:t-1})$
* implicit distributions (e.g. GANs)
* x = $g_\theta(z), z\sim \mathcal{N}(0, I)$
* flow models (e.g. RealNVP)
* like above but $g_\theta(z)$ invertible
* latent variable models (LVMs, e.g. VAE)
* $p_\theta(x) = \int p_\theta(x, z) dz$
---
### Latent variable models
$$
p_\theta(x) = \int p_\theta(x, z) dz
$$
---
### Latent variable models
$$
p_\theta(x) = \int p_\theta(x\vert z) p_\theta(z) dz
$$
---
### Motivation 1
#### "it makes sense"
* describes data in terms of a generative process
* e.g. object properties, locations
* learnt $z$ often interpretable
* causal reasoning often needs latent variables
---
### Motivation 2
#### manifold assumption
* high-dimensional data
* doesn't occupy all the space
* concentrated along low-dimensional manifold
* $z \approx$ intrinsic coordinates within the manifold
---
### Motivation 3
#### from simple to complicated
$$
p_\theta(x) = \int p_\theta(x\vert z) p_\theta(z) dz
$$
---
### Motivation 3
#### from simple to complicated
$$
\underbrace{p_\theta(x)}_\text{complicated} = \int \underbrace{p_\theta(x\vert z) }_\text{simple}\underbrace{p_\theta(z)}_\text{simple} dz
$$
---
### Motivation 3
#### from simple to complicated
$$
\underbrace{p_\theta(x)}_\text{complicated} = \int \underbrace{\mathcal{N}\left(x; \mu_\theta(z), \operatorname{diag}(\sigma_\theta(z)) \right)}_\text{simple}\underbrace{\mathcal{N}(z; 0, I)}_\text{simple} dz
$$
---
### Motivation 4
#### variational learning
* evaluating $p_\theta(x)$ is hard
* learning is hard
* evaluating $p_\theta(z\vert x)$ is hard
* inference is hard
* variational framework:
* approximate learning
* approximate inference
---
### Variational autoencoder
* Decoder: $p_\theta(x\vert z) = \mathcal{N}(\mu_\theta(z), \sigma_n I)$
* Encoder: $q_\psi(z\vert x) = \mathcal{N}(\mu_\psi(z), \sigma_\psi(z))$
* Prior: $p_\theta(z)=\mathcal{N}(0, I)$
---
### Variational learning
$$
\theta^\text{ML} = \operatorname{argmax}_\theta \sum_{x_i \in \mathcal{D}} \log p_\theta(x_i)
$$
---
### Variational learning
$$
\mathcal{L}(\theta, \psi) = \sum_{x_i \in \mathcal{D}} \log p_\theta(x_i) - \operatorname{KL}[q_\psi(z\vert x_i) \| p_\theta(z\vert x_i)]
$$
---
### Variational learning
$$
\mathcal{L}(\theta, \psi) = \sum_{x_i \in \mathcal{D}} \log p_\theta(x_i) + \mathbb{E}_{z\sim q_\psi} \log \frac{p_\theta(z\vert x_i)}{q_\psi(z\vert x_i)}
$$
---
### Variational learning
$$
\mathcal{L}(\theta, \psi) = \sum_{x_i \in \mathcal{D}} \mathbb{E}_{z\sim q_\psi} \log \frac{p_\theta(z\vert x_i) p_\theta(x_i)}{q_\psi(z\vert x_i)}
$$
---
### Variational learning
$$
\mathcal{L}(\theta, \psi) = \sum_{x_i \in \mathcal{D}} \mathbb{E}_{z\sim q_\psi} \log \frac{p_\theta(z, x_i)}{q_\psi(z\vert x_i)}
$$
---
### Variational learning
$$
\mathcal{L}(\theta, \psi) = \sum_{x_i \in \mathcal{D}} \mathbb{E}_{z\sim q_\psi(z\vert x_i)} \log p(x_i\vert z) - \operatorname{KL}[q_\psi(z\vert x_i)\vert p_\theta(z)]
$$
---
### Variational learning
$$
\mathcal{L}(\theta, \psi) = \sum_{x_i \in \mathcal{D}} \underbrace{\mathbb{E}_{z\sim q_\psi(z\vert x_i)} \log p(x_i\vert z)}_\text{reconstruction} - \operatorname{KL}[q_\psi(z\vert x_i)\vert p_\theta(z)]
$$
---
![](https://i.imgur.com/rmgzOHJ.png =600x)
[(Kingma and Welling, 2019)](https://arxiv.org/abs/1906.02691) Variational Autoencoder
---
### Variational encoder: interpretable $z$
![](https://i.imgur.com/kDgc74S.png)
---
### Discussion of max likelihood
* trained so that $p_\theta(x)$ matches data
* evaluated by how useful $p_\theta(z\vert x)$ is
* there is a mismatch
---
#### Representation learning vs max likelihood
![](https://i.imgur.com/SPx9AoA.png)
---
#### Representation learning vs max likelihood
![](https://i.imgur.com/EqHhQVh.png)
---
#### Representation learning vs max likelihood
![](https://i.imgur.com/L0n5kSI.png)
---
#### Representation learning vs max likelihood
![](https://i.imgur.com/wuAdSbB.png)
---
#### Representation learning vs max likelihood
![](https://i.imgur.com/DwGlp8k.png)
---
#### Representation learning vs max likelihood
![](https://i.imgur.com/yuoEcbt.png)
---
### Discussion of max likelihood
* max likelihood may not produce good representations
* Why do variational methods find good representations?
* Are there alternative principles?
---
## Self-supervised learning
---
## Basic idea
* turn unsupervised problem into supervised one
* turn datapoints $x_i$ into input-output pairs
* called auxiliary or pretext task
* learn to solve auxiliary task
* transfer representation leaned to the downstream task
---
### Example: jigsaw puzzles
![](https://i.imgur.com/VtCWtrq.jpg)
[(Noroozi and Favaro, 2016)](https://arxiv.org/abs/1603.09246)
---
### Data-efficiency in downstream task
![](https://i.imgur.com/bX3BzNx.png =570x)
[(Hènaff et al, 2020)](https://proceedings.icml.cc/static/paper_files/icml/2020/3694-Paper.pdf)
---
### Linearity in downstream task
![](https://i.imgur.com/4YiDM38.png =570x)
[(Chen et al, 2020)](https://arxiv.org/abs/2002.05709)
---
### Several self-supervised methods
* auto-encoding
* denoising auto-encoding
* pseudo-likelihood
* instance classification
* contrastive learning
* masked language models
---
### Example: instance classification
* pick random data index $i$
* randomly transform image $x_i$: $T(x_i)$
* auxilliary task: guess data index $i$ from transformed input $T(x_i)$
* difficulty: N-way classification
---
### Example: contrastive learning
* pick random $y$
* if $y=1$ pick two random images $x_1$, $x_2$
* if $y=0$ use same image twice $x_1=x_2$
* aux task: predict $y$ from $f_\theta(T_1(x_1)), f_\theta(T_2(x_2))$
---
### Example: Masked Language Models
![](https://i.imgur.com/HeRgOXp.png)
<small>image credit: ([Lample and Conneau, 2019](https://arxiv.org/pdf/1901.07291.pdf))</small>
---
### BERT
![](https://i.imgur.com/zeA1Mix.jpg)
---
## Questions?
---
## Why should any of this work?
---
#### Predicting What you Already Know Helps: Provable Self-Supervised Learning
[(Lee et al, 2020)](https://arxiv.org/abs/2008.01064)
---
### Provable Self-Supervised Learning
Assumptions:
* observable $X$ decomposes into $X_1, X_2$
* pretext: only given $(X_1, X_2)$ pairs
* downstream: we will want to predict $Y$
* $X_1 \perp \!\!\! \perp X_2 \vert Y, Z$
* (+1 additional strong assumption)
---
### Provable Self-Supervised Learning
![](https://i.imgur.com/8SomFq9.png)
$X_1 \perp \!\!\! \perp X_2 \vert Y, Z$
---
### Provable Self-Supervised Learning
![](https://i.imgur.com/K754eB2.png)
---
### Provable Self-Supervised Learning
![](https://i.imgur.com/SuCKgJq.png)
$$
X_1 \perp \!\!\! \perp X_2 \vert Y, Z
$$
---
### Provable Self-Supervised Learning
![](https://i.imgur.com/SuCKgJq.png)
$$
👀 \perp \!\!\! \perp 👄 \vert \text{age}, \text{gender}, \text{ethnicity}
$$
---
### Provable Self-Supervised Learning
If $X_1 \perp \!\!\! \perp X_2 \vert Y$, then
$$
\mathbb{E}[X_2 \vert X_1] = \sum_k \mathbb{E}[X_2\vert Y=k] \mathbb{P}[Y=k\vert X_1 = x_1]
$$
---
### Provable Self-Supervised Learning
\begin{align}
&\mathbb{E}[X_2 \vert X_1=x_1] = \\
&\left[\begin{matrix} \mathbb{E}[X_2\vert Y=1], \ldots, \mathbb{E}[X_2\vert Y=k]\end{matrix}\right] \left[\begin{matrix} \mathbb{P}[Y=1\vert X_1=x_1]\\ \vdots \\ \mathbb{P}[Y=k\vert X_1=x_1]\end{matrix}\right]
\end{align}
---
### Provable Self-Supervised Learning
\begin{align}
&\mathbb{E}[X_2 \vert X_1=x_1] = \\
&\underbrace{\left[\begin{matrix} \mathbb{E}[X_2\vert Y=1], \ldots, \mathbb{E}[X_2\vert Y=k]\end{matrix}\right]}_\mathbf{A}\left[\begin{matrix} \mathbb{P}[Y=1\vert X_1=x_1]\\ \vdots \\ \mathbb{P}[Y=k\vert X_1=x_1]\end{matrix}\right]
\end{align}
---
### Provable Self-Supervised Learning
$$
\mathbb{E}[X_2 \vert X_1=x_1] = \mathbf{A}\left[\begin{matrix} \mathbb{P}[Y=1\vert X_1=x_1]\\ \vdots \\ \mathbb{P}[Y=k\vert X_1=x_1]\end{matrix}\right]
$$
---
### Provable Self-Supervised Learning
$$
\mathbf{A}^\dagger \mathbb{E}[X_2 \vert X_1=x_1] = \left[\begin{matrix} \mathbb{P}[Y=1\vert X_1=x_1]\\ \vdots \\ \mathbb{P}[Y=k\vert X_1=x_1]\end{matrix}\right]
$$
---
### Provable Self-Supervised Learning
$$
\mathbf{A}^\dagger \underbrace{\mathbb{E}[X_2 \vert X_1=x_1]}_\text{pretext task} = \underbrace{\left[\begin{matrix} \mathbb{P}[Y=1\vert X_1=x_1]\\ \vdots \\ \mathbb{P}[Y=k\vert X_1=x_1]\end{matrix}\right]}_\text{downstream task}
$$
---
### Provable self-supervised learning summary
* under assumptions of conditional independence
* (and that matrix $\mathbf{A}$ is full rank)
* $\mathbb{P}[Y|x_1]$ is in linear span of $\mathbb{E}[X_2\vert x_1]$
* All we need is linear model on top of $\mathbb{E}[X_2\vert x_1]$
* note: $\mathbb{P}[Y|x_1, x_2]$ would be really optimal
---
# Recap
{"metaMigratedAt":"2023-06-15T20:30:26.163Z","metaMigratedFrom":"YAML","title":"DeepNN Lecture 12 Slides","breaks":true,"description":"Lecture slides on unsupervised representation learning, transfer learning, self-supervised learning","contributors":"[{\"id\":\"e558be3b-4a2d-4524-8a66-38ec9fea8715\",\"add\":14862,\"del\":4818}]"}