# Representation and Transfer Learning ### Ferenc Huszár (fh277) <!-- Put the link to this slide here so people can follow --> slides: https://hackmd.io/@fhuszar/r1HxvooMd --- ### Unsupervised learning - observations $x_1, x_2, \ldots$ - drawn i.i.d. from some $p_\mathcal{D}$ - can we learn something from this? --- ### Unsupervised learning goals - can we learn something from this? - a model of data distribution $p_\theta(x) \approx p_{\mathcal{D}}(x)$ - compression - data reconstruction - sampling/generation - a representation $z=g_\theta(x)$ or $q_{\theta}(z\vert x)$ - downstream classification task - data visualisation --- ### UL as distribution modeling - defines goal as modeling $p_\theta(x)\approx p_\mathcal{D}(x)$ - $\theta$: parameters - maximum likelihood estimation: $$ \theta^{ML} = \operatorname{argmax}_\theta \sum_{x_i \in \mathcal{D}} \log p_\theta(x_i) $$ --- ### Deep learning for modelling distributions * auto-regressive models (e.g. RNNs) * $p_{\theta}(x_{1:T}) = \prod_{t=1}^T p_\theta(x_t\vert x_{1:t-1})$ * implicit distributions (e.g. GANs) * x = $g_\theta(z), z\sim \mathcal{N}(0, I)$ * flow models (e.g. RealNVP) * like above but $g_\theta(z)$ invertible * latent variable models (LVMs, e.g. VAE) * $p_\theta(x) = \int p_\theta(x, z) dz$ --- ### Latent variable models $$ p_\theta(x) = \int p_\theta(x, z) dz $$ --- ### Latent variable models $$ p_\theta(x) = \int p_\theta(x\vert z) p_\theta(z) dz $$ --- ### Motivation 1 #### "it makes sense" * describes data in terms of a generative process * e.g. object properties, locations * learnt $z$ often interpretable * causal reasoning often needs latent variables --- ### Motivation 2 #### manifold assumption * high-dimensional data * doesn't occupy all the space * concentrated along low-dimensional manifold * $z \approx$ intrinsic coordinates within the manifold --- ### Motivation 3 #### from simple to complicated $$ p_\theta(x) = \int p_\theta(x\vert z) p_\theta(z) dz $$ --- ### Motivation 3 #### from simple to complicated $$ \underbrace{p_\theta(x)}_\text{complicated} = \int \underbrace{p_\theta(x\vert z) }_\text{simple}\underbrace{p_\theta(z)}_\text{simple} dz $$ --- ### Motivation 3 #### from simple to complicated $$ \underbrace{p_\theta(x)}_\text{complicated} = \int \underbrace{\mathcal{N}\left(x; \mu_\theta(z), \operatorname{diag}(\sigma_\theta(z)) \right)}_\text{simple}\underbrace{\mathcal{N}(z; 0, I)}_\text{simple} dz $$ --- ### Motivation 4 #### variational learning * evaluating $p_\theta(x)$ is hard * learning is hard * evaluating $p_\theta(z\vert x)$ is hard * inference is hard * variational framework: * approximate learning * approximate inference --- ### Variational autoencoder * Decoder: $p_\theta(x\vert z) = \mathcal{N}(\mu_\theta(z), \sigma_n I)$ * Encoder: $q_\psi(z\vert x) = \mathcal{N}(\mu_\psi(z), \sigma_\psi(z))$ * Prior: $p_\theta(z)=\mathcal{N}(0, I)$ --- ### Variational learning $$ \theta^\text{ML} = \operatorname{argmax}_\theta \sum_{x_i \in \mathcal{D}} \log p_\theta(x_i) $$ --- ### Variational learning $$ \mathcal{L}(\theta, \psi) = \sum_{x_i \in \mathcal{D}} \log p_\theta(x_i) - \operatorname{KL}[q_\psi(z\vert x_i) \| p_\theta(z\vert x_i)] $$ --- ### Variational learning $$ \mathcal{L}(\theta, \psi) = \sum_{x_i \in \mathcal{D}} \log p_\theta(x_i) + \mathbb{E}_{z\sim q_\psi} \log \frac{p_\theta(z\vert x_i)}{q_\psi(z\vert x_i)} $$ --- ### Variational learning $$ \mathcal{L}(\theta, \psi) = \sum_{x_i \in \mathcal{D}} \mathbb{E}_{z\sim q_\psi} \log \frac{p_\theta(z\vert x_i) p_\theta(x_i)}{q_\psi(z\vert x_i)} $$ --- ### Variational learning $$ \mathcal{L}(\theta, \psi) = \sum_{x_i \in \mathcal{D}} \mathbb{E}_{z\sim q_\psi} \log \frac{p_\theta(z, x_i)}{q_\psi(z\vert x_i)} $$ --- ### Variational learning $$ \mathcal{L}(\theta, \psi) = \sum_{x_i \in \mathcal{D}} \mathbb{E}_{z\sim q_\psi(z\vert x_i)} \log p(x_i\vert z) - \operatorname{KL}[q_\psi(z\vert x_i)\vert p_\theta(z)] $$ --- ### Variational learning $$ \mathcal{L}(\theta, \psi) = \sum_{x_i \in \mathcal{D}} \underbrace{\mathbb{E}_{z\sim q_\psi(z\vert x_i)} \log p(x_i\vert z)}_\text{reconstruction} - \operatorname{KL}[q_\psi(z\vert x_i)\vert p_\theta(z)] $$ --- ![](https://i.imgur.com/rmgzOHJ.png =600x) [(Kingma and Welling, 2019)](https://arxiv.org/abs/1906.02691) Variational Autoencoder --- ### Variational encoder: interpretable $z$ ![](https://i.imgur.com/kDgc74S.png) --- ### Discussion of max likelihood * trained so that $p_\theta(x)$ matches data * evaluated by how useful $p_\theta(z\vert x)$ is * there is a mismatch --- #### Representation learning vs max likelihood ![](https://i.imgur.com/SPx9AoA.png) --- #### Representation learning vs max likelihood ![](https://i.imgur.com/EqHhQVh.png) --- #### Representation learning vs max likelihood ![](https://i.imgur.com/L0n5kSI.png) --- #### Representation learning vs max likelihood ![](https://i.imgur.com/wuAdSbB.png) --- #### Representation learning vs max likelihood ![](https://i.imgur.com/DwGlp8k.png) --- #### Representation learning vs max likelihood ![](https://i.imgur.com/yuoEcbt.png) --- ### Discussion of max likelihood * max likelihood may not produce good representations * Why do variational methods find good representations? * Are there alternative principles? --- ## Self-supervised learning --- ## Basic idea * turn unsupervised problem into supervised one * turn datapoints $x_i$ into input-output pairs * called auxiliary or pretext task * learn to solve auxiliary task * transfer representation leaned to the downstream task --- ### Example: jigsaw puzzles ![](https://i.imgur.com/VtCWtrq.jpg) [(Noroozi and Favaro, 2016)](https://arxiv.org/abs/1603.09246) --- ### Data-efficiency in downstream task ![](https://i.imgur.com/bX3BzNx.png =570x) [(Hènaff et al, 2020)](https://proceedings.icml.cc/static/paper_files/icml/2020/3694-Paper.pdf) --- ### Linearity in downstream task ![](https://i.imgur.com/4YiDM38.png =570x) [(Chen et al, 2020)](https://arxiv.org/abs/2002.05709) --- ### Several self-supervised methods * auto-encoding * denoising auto-encoding * pseudo-likelihood * instance classification * contrastive learning * masked language models --- ### Example: instance classification * pick random data index $i$ * randomly transform image $x_i$: $T(x_i)$ * auxilliary task: guess data index $i$ from transformed input $T(x_i)$ * difficulty: N-way classification --- ### Example: contrastive learning * pick random $y$ * if $y=1$ pick two random images $x_1$, $x_2$ * if $y=0$ use same image twice $x_1=x_2$ * aux task: predict $y$ from $f_\theta(T_1(x_1)), f_\theta(T_2(x_2))$ --- ### Example: Masked Language Models ![](https://i.imgur.com/HeRgOXp.png) <small>image credit: ([Lample and Conneau, 2019](https://arxiv.org/pdf/1901.07291.pdf))</small> --- ### BERT ![](https://i.imgur.com/zeA1Mix.jpg) --- ## Questions? --- ## Why should any of this work? --- #### Predicting What you Already Know Helps: Provable Self-Supervised Learning [(Lee et al, 2020)](https://arxiv.org/abs/2008.01064) --- ### Provable Self-Supervised Learning Assumptions: * observable $X$ decomposes into $X_1, X_2$ * pretext: only given $(X_1, X_2)$ pairs * downstream: we will want to predict $Y$ * $X_1 \perp \!\!\! \perp X_2 \vert Y, Z$ * (+1 additional strong assumption) --- ### Provable Self-Supervised Learning ![](https://i.imgur.com/8SomFq9.png) $X_1 \perp \!\!\! \perp X_2 \vert Y, Z$ --- ### Provable Self-Supervised Learning ![](https://i.imgur.com/K754eB2.png) --- ### Provable Self-Supervised Learning ![](https://i.imgur.com/SuCKgJq.png) $$ X_1 \perp \!\!\! \perp X_2 \vert Y, Z $$ --- ### Provable Self-Supervised Learning ![](https://i.imgur.com/SuCKgJq.png) $$ 👀 \perp \!\!\! \perp 👄 \vert \text{age}, \text{gender}, \text{ethnicity} $$ --- ### Provable Self-Supervised Learning If $X_1 \perp \!\!\! \perp X_2 \vert Y$, then $$ \mathbb{E}[X_2 \vert X_1] = \sum_k \mathbb{E}[X_2\vert Y=k] \mathbb{P}[Y=k\vert X_1 = x_1] $$ --- ### Provable Self-Supervised Learning \begin{align} &\mathbb{E}[X_2 \vert X_1=x_1] = \\ &\left[\begin{matrix} \mathbb{E}[X_2\vert Y=1], \ldots, \mathbb{E}[X_2\vert Y=k]\end{matrix}\right] \left[\begin{matrix} \mathbb{P}[Y=1\vert X_1=x_1]\\ \vdots \\ \mathbb{P}[Y=k\vert X_1=x_1]\end{matrix}\right] \end{align} --- ### Provable Self-Supervised Learning \begin{align} &\mathbb{E}[X_2 \vert X_1=x_1] = \\ &\underbrace{\left[\begin{matrix} \mathbb{E}[X_2\vert Y=1], \ldots, \mathbb{E}[X_2\vert Y=k]\end{matrix}\right]}_\mathbf{A}\left[\begin{matrix} \mathbb{P}[Y=1\vert X_1=x_1]\\ \vdots \\ \mathbb{P}[Y=k\vert X_1=x_1]\end{matrix}\right] \end{align} --- ### Provable Self-Supervised Learning $$ \mathbb{E}[X_2 \vert X_1=x_1] = \mathbf{A}\left[\begin{matrix} \mathbb{P}[Y=1\vert X_1=x_1]\\ \vdots \\ \mathbb{P}[Y=k\vert X_1=x_1]\end{matrix}\right] $$ --- ### Provable Self-Supervised Learning $$ \mathbf{A}^\dagger \mathbb{E}[X_2 \vert X_1=x_1] = \left[\begin{matrix} \mathbb{P}[Y=1\vert X_1=x_1]\\ \vdots \\ \mathbb{P}[Y=k\vert X_1=x_1]\end{matrix}\right] $$ --- ### Provable Self-Supervised Learning $$ \mathbf{A}^\dagger \underbrace{\mathbb{E}[X_2 \vert X_1=x_1]}_\text{pretext task} = \underbrace{\left[\begin{matrix} \mathbb{P}[Y=1\vert X_1=x_1]\\ \vdots \\ \mathbb{P}[Y=k\vert X_1=x_1]\end{matrix}\right]}_\text{downstream task} $$ --- ### Provable self-supervised learning summary * under assumptions of conditional independence * (and that matrix $\mathbf{A}$ is full rank) * $\mathbb{P}[Y|x_1]$ is in linear span of $\mathbb{E}[X_2\vert x_1]$ * All we need is linear model on top of $\mathbb{E}[X_2\vert x_1]$ * note: $\mathbb{P}[Y|x_1, x_2]$ would be really optimal --- # Recap
{"metaMigratedAt":"2023-06-15T20:30:26.163Z","metaMigratedFrom":"YAML","title":"DeepNN Lecture 12 Slides","breaks":true,"description":"Lecture slides on unsupervised representation learning, transfer learning, self-supervised learning","contributors":"[{\"id\":\"e558be3b-4a2d-4524-8a66-38ec9fea8715\",\"add\":14862,\"del\":4818}]"}
    660 views