or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Do you want to remove this version name and description?
Syncing
xxxxxxxxxx
Representation and Transfer Learning
Ferenc Huszár (fh277)
slides: https://hackmd.io/@fhuszar/r1HxvooMd
Unsupervised learning
Unsupervised learning goals
UL as distribution modeling
\[ \theta^{ML} = \operatorname{argmax}_\theta \sum_{x_i \in \mathcal{D}} \log p_\theta(x_i) \]
Deep learning for modelling distributions
Latent variable models
\[ p_\theta(x) = \int p_\theta(x, z) dz \]
Latent variable models
\[ p_\theta(x) = \int p_\theta(x\vert z) p_\theta(z) dz \]
Motivation 1
"it makes sense"
Motivation 2
manifold assumption
Motivation 3
from simple to complicated
\[ p_\theta(x) = \int p_\theta(x\vert z) p_\theta(z) dz \]
Motivation 3
from simple to complicated
\[ \underbrace{p_\theta(x)}_\text{complicated} = \int \underbrace{p_\theta(x\vert z) }_\text{simple}\underbrace{p_\theta(z)}_\text{simple} dz \]
Motivation 3
from simple to complicated
\[ \underbrace{p_\theta(x)}_\text{complicated} = \int \underbrace{\mathcal{N}\left(x; \mu_\theta(z), \operatorname{diag}(\sigma_\theta(z)) \right)}_\text{simple}\underbrace{\mathcal{N}(z; 0, I)}_\text{simple} dz \]
Motivation 4
variational learning
Variational autoencoder
Variational learning
\[ \theta^\text{ML} = \operatorname{argmax}_\theta \sum_{x_i \in \mathcal{D}} \log p_\theta(x_i) \]
Variational learning
\[ \mathcal{L}(\theta, \psi) = \sum_{x_i \in \mathcal{D}} \log p_\theta(x_i) - \operatorname{KL}[q_\psi(z\vert x_i) \| p_\theta(z\vert x_i)] \]
Variational learning
\[ \mathcal{L}(\theta, \psi) = \sum_{x_i \in \mathcal{D}} \log p_\theta(x_i) + \mathbb{E}_{z\sim q_\psi} \log \frac{p_\theta(z\vert x_i)}{q_\psi(z\vert x_i)} \]
Variational learning
\[ \mathcal{L}(\theta, \psi) = \sum_{x_i \in \mathcal{D}} \mathbb{E}_{z\sim q_\psi} \log \frac{p_\theta(z\vert x_i) p_\theta(x_i)}{q_\psi(z\vert x_i)} \]
Variational learning
\[ \mathcal{L}(\theta, \psi) = \sum_{x_i \in \mathcal{D}} \mathbb{E}_{z\sim q_\psi} \log \frac{p_\theta(z, x_i)}{q_\psi(z\vert x_i)} \]
Variational learning
\[ \mathcal{L}(\theta, \psi) = \sum_{x_i \in \mathcal{D}} \mathbb{E}_{z\sim q_\psi(z\vert x_i)} \log p(x_i\vert z) - \operatorname{KL}[q_\psi(z\vert x_i)\vert p_\theta(z)] \]
Variational learning
\[ \mathcal{L}(\theta, \psi) = \sum_{x_i \in \mathcal{D}} \underbrace{\mathbb{E}_{z\sim q_\psi(z\vert x_i)} \log p(x_i\vert z)}_\text{reconstruction} - \operatorname{KL}[q_\psi(z\vert x_i)\vert p_\theta(z)] \]
(Kingma and Welling, 2019) Variational Autoencoder
Variational encoder: interpretable \(z\)
Discussion of max likelihood
Representation learning vs max likelihood
Representation learning vs max likelihood
Representation learning vs max likelihood
Representation learning vs max likelihood
Representation learning vs max likelihood
Representation learning vs max likelihood
Discussion of max likelihood
Self-supervised learning
Basic idea
Example: jigsaw puzzles
(Noroozi and Favaro, 2016)
Data-efficiency in downstream task
(Hènaff et al, 2020)
Linearity in downstream task
(Chen et al, 2020)
Several self-supervised methods
Example: instance classification
Example: contrastive learning
Example: Masked Language Models
image credit: (Lample and Conneau, 2019)
BERT
Questions?
Why should any of this work?
Predicting What you Already Know Helps: Provable Self-Supervised Learning
(Lee et al, 2020)
Provable Self-Supervised Learning
Assumptions:
Provable Self-Supervised Learning
\(X_1 \perp \!\!\! \perp X_2 \vert Y, Z\)
Provable Self-Supervised Learning
Provable Self-Supervised Learning
\[ X_1 \perp \!\!\! \perp X_2 \vert Y, Z \]
Provable Self-Supervised Learning
\[ 👀 \perp \!\!\! \perp 👄 \vert \text{age}, \text{gender}, \text{ethnicity} \]
Provable Self-Supervised Learning
If \(X_1 \perp \!\!\! \perp X_2 \vert Y\), then
\[ \mathbb{E}[X_2 \vert X_1] = \sum_k \mathbb{E}[X_2\vert Y=k] \mathbb{P}[Y=k\vert X_1 = x_1] \]
Provable Self-Supervised Learning
\begin{align} &\mathbb{E}[X_2 \vert X_1=x_1] = \\ &\left[\begin{matrix} \mathbb{E}[X_2\vert Y=1], \ldots, \mathbb{E}[X_2\vert Y=k]\end{matrix}\right] \left[\begin{matrix} \mathbb{P}[Y=1\vert X_1=x_1]\\ \vdots \\ \mathbb{P}[Y=k\vert X_1=x_1]\end{matrix}\right] \end{align}
Provable Self-Supervised Learning
\begin{align} &\mathbb{E}[X_2 \vert X_1=x_1] = \\ &\underbrace{\left[\begin{matrix} \mathbb{E}[X_2\vert Y=1], \ldots, \mathbb{E}[X_2\vert Y=k]\end{matrix}\right]}_\mathbf{A}\left[\begin{matrix} \mathbb{P}[Y=1\vert X_1=x_1]\\ \vdots \\ \mathbb{P}[Y=k\vert X_1=x_1]\end{matrix}\right] \end{align}
Provable Self-Supervised Learning
\[ \mathbb{E}[X_2 \vert X_1=x_1] = \mathbf{A}\left[\begin{matrix} \mathbb{P}[Y=1\vert X_1=x_1]\\ \vdots \\ \mathbb{P}[Y=k\vert X_1=x_1]\end{matrix}\right] \]
Provable Self-Supervised Learning
\[ \mathbf{A}^\dagger \mathbb{E}[X_2 \vert X_1=x_1] = \left[\begin{matrix} \mathbb{P}[Y=1\vert X_1=x_1]\\ \vdots \\ \mathbb{P}[Y=k\vert X_1=x_1]\end{matrix}\right] \]
Provable Self-Supervised Learning
\[ \mathbf{A}^\dagger \underbrace{\mathbb{E}[X_2 \vert X_1=x_1]}_\text{pretext task} = \underbrace{\left[\begin{matrix} \mathbb{P}[Y=1\vert X_1=x_1]\\ \vdots \\ \mathbb{P}[Y=k\vert X_1=x_1]\end{matrix}\right]}_\text{downstream task} \]
Provable self-supervised learning summary
Recap