Meta Learning - ML 2019

--- tags: Machine Learning - Hung Yi Lee --- Meta Learning - ML 2019 === [TOC] # [MAML (1/9)](https://www.youtube.com/watch?v=EkAqYbpCYAc&list=PLJV_el3uVTsOK_ZK5L0Iv_EQoL1JefRL4&index=33&t=257s) ## Introduction of Meta Learning - Life-long Learning = one model for all the tasks - Meta Learning = How to learn a new model ### Machine Learning ![](https://i.imgur.com/bhSDOkq.png) ### Meta Learning ![](https://i.imgur.com/X7FMwbJ.png) - Machine Learning 可以說是根據資料找一個函數 f 的能力 - f 的 input 是 (一筆) 資料，output 是 prediction - Meta Learning 可以說是根據資料找一個函數 F 的能力，這個函數 F 可以找到上面說的函數 f - F 的 input 是一個資料集，output 是一個 function f (可能是一個 NN 的 (超?)參數) # [MAML (2/9)](https://www.youtube.com/watch?v=9k4ND-xjcgM&list=PLJV_el3uVTsOK_ZK5L0Iv_EQoL1JefRL4&index=33) 1. Define a set of learning algorithm - 讓機器自己設計一些 learning algorithm，例如 network structure、initialization parameter、得到 gradient 之後 optimize 的方式、activation function ... 等。 2. Defining the goodness of a function $F$ - $L(F) = \sum_\limits{n=1}^N l^n$, i.e. sum of **test loss** over all $N$ tasks # [MAML (3/9)](https://www.youtube.com/watch?v=PznN0w7dYc0&list=PLJV_el3uVTsOK_ZK5L0Iv_EQoL1JefRL4&index=34) - 在 meta learning 中 Tasks 內的 training data 被稱為 **Support set**；testing data 被稱為 **Query set** - 假設 train 每個 task 都需要一天或很久，那 meta-learning 的研究很難做，因此往往會假設每個 task 的訓練資料都很少，因此常常跟 **few-shot learning** 扯上關係 # [MAML (4/9)](https://www.youtube.com/watch?v=knaAdp5uWRg&list=PLJV_el3uVTsOK_ZK5L0Iv_EQoL1JefRL4&index=35) Omniglot - Few shot Classification - 1623 characters - Each has 20 examples - **N-ways K-shot classification: 在每個 training & test tasks 每個 task 有 N classes、每個 class 有 K 個 examples** - 要在這個 dataset 上做 meta learning，會先把 class 分成 training & testing 的 class。然後對於每個 training task，都隨機抽 N 個 class，每個 class 又再抽 K 個 example。testing tasks 同理。 # [MAML (5/9)](https://www.youtube.com/watch?v=vUwOA3SNb_E&list=PLJV_el3uVTsOK_ZK5L0Iv_EQoL1JefRL4&index=36) ## Techniques Today - MAML - Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML 2017 - Reptile - On Fist-Order Meta-Learning Algorithms. arXiv 2018 - match network - prototype network ## MAML 來 learn 一個最好的 **initialization parameter** ![](https://i.imgur.com/x4zYD3R.png) - 所有 task 用同樣的 initialization，因此所有 task 的 model structure 必須一樣 - $\phi$：(學到的) initialization 的參數 - $\hat\theta^n$：task $n$ model 利用初始化參數 $\phi$ 訓練後得到的參數 ### MAML 與 Model Pre-training 的差異略 ### 實作上 - 只考慮一步的 gradient descent，理由： - 快 - initialization 真的很好的話，只 update 一次也可以很好 - testing 的時候還是可以 update 很多次參數 - few-shot learning 的 data 很少，update 太多次可能 overfitting # [MAML (6/9)](https://www.youtube.com/watch?v=dV-Crj8hsJM&list=PLJV_el3uVTsOK_ZK5L0Iv_EQoL1JefRL4&index=37) Experiments on Omniglot & Mini-ImageNet # [MAML (7/9)](https://www.youtube.com/watch?v=mxqzGwP_Qys&list=PLJV_el3uVTsOK_ZK5L0Iv_EQoL1JefRL4&index=38) ## math ### meta learning 的 gradient descent 以及 task 的 gradient descent ![](https://i.imgur.com/fiz0B6Y.png) ### meta learning 的 loss ![](https://i.imgur.com/98QalMd.png) ### 根據 chain rule 計算每個初始參數 $\phi_i$ 對訓練後 loss 的偏微分 ![](https://i.imgur.com/pgvPPxq.png) 那因為每個初始化參數 $\phi_i$ 都會影響訓練後的每個參數 $\hat\theta_j$ 然後影響到最後的 loss function，因此 $\dfrac{\partial l(\hat\theta)}{\partial\phi_i} = \sum_\limits j \dfrac{\partial l(\hat\theta)}{\partial \hat\theta_j}\dfrac{\partial\hat\theta_j}{\partial\phi_i}$ - 這裡比較麻煩的是算 $\dfrac{\partial\hat\theta_j}{\partial\phi_i}$ - 計算 $\dfrac{\partial\hat\theta_j}{\partial\phi_i}$，先得知 $\hat\theta_j = \phi_j - \epsilon\dfrac{\partial l(\phi)}{\partial \phi_j}$ - $\epsilon$ 是在訓練每個 task 的時候的 learning rate - 當 $i\neq j$ 時，$\dfrac{\partial\hat\theta_j}{\partial\phi_i} = -\epsilon\dfrac{\partial l(\phi)}{\partial\phi_i\partial\phi_j}$ - 當 $i = j$ 時，$\dfrac{\partial\hat\theta_j}{\partial\phi_i} = 1 -\epsilon\dfrac{\partial l(\phi)}{\partial\phi_i\partial\phi_j}$ 不過要計算二次微分 cost 很大，因此在 MAML 這篇 paper 直接省略二次微分項 (WTF?)，結果就是 - 當 $i\neq j$，$\dfrac{\partial\hat\theta_j}{\partial\phi_i}\approx 0$ - 當 $i = j$，$\dfrac{\partial\hat\theta_j}{\partial\phi_i}\approx 1$ - ***可是為什麼二次微分可以用 0 來 approximate ???*** 再代回 $\dfrac{\partial l(\hat\theta)}{\partial\phi_i} = \sum_\limits j \dfrac{\partial l(\hat\theta)}{\partial \hat\theta_j}\dfrac{\partial\hat\theta_j}{\partial\phi_i}$ 因此 $\dfrac{\partial l(\hat\theta)}{\partial\phi_i} \approx \dfrac{\partial l(\hat\theta)}{\partial\hat\theta_i}$ ![](https://i.imgur.com/diTRQnD.png) ### 最後做 gradient descent 時其實是利用 $\nabla\hat\theta^n$ 在做 update # [MAML (8/9)](https://www.youtube.com/watch?v=3z997JhL9Oo&list=PLJV_el3uVTsOK_ZK5L0Iv_EQoL1JefRL4&index=39) ## Real Implementation ![](https://i.imgur.com/QvR1ASn.png) 1. 每個 mini-batch 會 sample 出 batch_size 個 task，如果是做 SGD 就只 sample 一個 task 更新 $\phi$ 的方向，等同於 $\hat\theta$ 的梯度方向，因此可以視為第二次更新 $\theta$ 時所計算的 gradient 方向 ### 可以實際應用在 machine translation 的 task 論文：arXiv 1808.08437 # [MAML (9/9)](https://www.youtube.com/watch?v=9jJe2AD35P8&list=PLJV_el3uVTsOK_ZK5L0Iv_EQoL1JefRL4&index=40) ## Reptile - 論文：Reptile: A Scalable Meta-Learning Algorithm. ![](https://i.imgur.com/HyG6dmS.png) - reptile 沒有限制只能 update 一次參數，task n 訓練完的參數為 $\hat\theta^n$ - 直接看 $\phi_0$ 到 $\hat\theta^n$ 應該要走什麼方向，直接用那個方向當 gradient 來 update $\phi$ ### Reptile v.s. MAML v.s. Pre-train ![](https://i.imgur.com/7ZCprYF.png) ## Crazy Idea ![](https://i.imgur.com/xF3UTgr.png) # [Gradient Descent as LSTM (1/3)](https://www.youtube.com/watch?v=NjZygLDXxjg&list=PLJV_el3uVTsOK_ZK5L0Iv_EQoL1JefRL4&index=41) 論文x2: - Optimization as a Model for Few-shot Learning. ICLR 2017 - Learning to learn by gradient descent by gradient descent. NIPS 2016 ## Review of RNN & LSTM ![](https://i.imgur.com/wT3N8nk.png) ![](https://i.imgur.com/ZJoGl4W.png) # [Gradient Descent as LSTM (2/3)](https://www.youtube.com/watch?v=G_xYYq772NQ&list=PLJV_el3uVTsOK_ZK5L0Iv_EQoL1JefRL4&index=42) ![](https://i.imgur.com/mjwrz52.png) LSTM 其實和 gradient descent 很相似 - $\theta^t = \theta^{t-1} - \eta\nabla_\theta l$ - $c^t = z^f\odot c^{t-1} + z^i\odot z$ - 當 $c^t = \theta^t, c^{t-1} = \theta^{t-1}, z = -\nabla_\theta l; z^f = \begin{bmatrix} 1 \\ 1 \\ ... \\ 1 \end{bmatrix}, z^i = \begin{bmatrix} \eta \\ \eta \\ ... \\ \eta \end{bmatrix}$ 時，LSTM 其實就和 gradient descent 一樣 - 所以 gradient descent 可以說是 LSTM 的簡化版 - 如果讓 LSTM 自己去學 $z^i$ 則就是在學一個 dynamic learning rate；讓 LSTM 自己學 $z^f$ 的話他會縮小參數 (因為介於0~1)，因此 $z^f$ 可以視為 regularization 的角色 ![](https://i.imgur.com/BJszAhk.png) - 這裡特別假設 $\theta$ 和 $\nabla_\theta l$ 無關，可以直接像一般的 LSTM 訓練，不然做 gradient descent 很麻煩 # [Gradient Descent as LSTM (3/3)](https://www.youtube.com/watch?v=p0Tn8oZWZbQ&list=PLJV_el3uVTsOK_ZK5L0Iv_EQoL1JefRL4&index=43) 實作上參數有百萬個，但不可能用百萬維的 LSTM，因此 1. 只有一個 LSTM cell 2. 所有參數 share 同一個 LSTM ![](https://i.imgur.com/qfgYv8G.png) - training 跟 testing 的 model 可以不一樣 (***0.0 我再想想...*** ![](https://i.imgur.com/3NZlKw3.png) - 老師認為這種 approach 滿合理的，還沒人做 # [Metric-based (1/3)](https://www.youtube.com/watch?v=yyKaACh_j3M&list=PLJV_el3uVTsOK_ZK5L0Iv_EQoL1JefRL4&index=44) 已知，筆記隨便做 ## Siamese Network ![](https://i.imgur.com/WzF33m4.png) - training 和 testing 一次做好。ex: NN 直接 output testing image 和 training 是否同個人 # [Metric-based (2/3)](https://www.youtube.com/watch?v=scK2EIT7klw&list=PLJV_el3uVTsOK_ZK5L0Iv_EQoL1JefRL4&index=45) 論文： - What kind of distance should we use? - SphereFace: Deep Hypersphere Embedding for Face Recognition - Additive Margin Softmax for Face Verification - ArcFace: Additive Angular Margin Loss for Deep Face Recognition - Triplet loss - Deep Metric Learning using Triplet Network - FaceNet: A Unified Embedding for Face Recognition and Clustering # [Metric-based (3/3)](https://www.youtube.com/watch?v=semSxPP2Yzg&list=PLJV_el3uVTsOK_ZK5L0Iv_EQoL1JefRL4&index=46) 若要在 5-ways 1-shot 怎麼做? ## Prototypical Network Prototypical Networks for Few-shot Learning. NIPS 2017 ![](https://i.imgur.com/dEEvG8H.png) 計算不同圖片經過同個 CNN embedding ，再計算相似度，然後用 softmax output probability，最後做 cross entropy loss 的 gradient descent - 那 few-shot 怎麼做? 直接 testing data 同 class 的不同圖片做 embedding 的平均，有新圖片就看和哪個 class 的平均靠最近 ## Matching Network Matching Network 比較舊，也沒比較好，還會用到 memory network，就不細看了 ![](https://i.imgur.com/yjy0v5S.png) - 用 bidirectional LSTM 處理每張圖片 - 其餘做法和 prototypical network 相同 ## Relation Network ![](https://i.imgur.com/6IyzPBW.png) - training 的 embedding 和 testing 的 embedding 會 concat - 相似度是用 NN 訓練的，不是人訂的 - embedding 和 similarity 是 jointly trained ## Few-shot learning for Imaginary Data ![](https://i.imgur.com/tDBtwwk.png) - 訓練一個 generator 把某個人的一些狀態想像出來，再丟進 NN 訓練 - generator 和 NN 可以 jointly trained # [Meta Learning - Train+Test as RNN](https://www.youtube.com/watch?v=ePimv_k-H24&list=PLJV_el3uVTsOK_ZK5L0Iv_EQoL1JefRL4&index=48) 還沒細看，不過是在介紹 - MANN: Memory Augmented Neural Network - SNAIL: Simple Neural Attentive Meta-Learner # [自動調整 Hyperparameters](https://www.youtube.com/watch?v=c10nxBcSH14&list=PLJV_el3uVTsOK_ZK5L0Iv_EQoL1JefRL4) - Grid Search - Random Search - top K results are good enough - 若有 N 種 hyperparameter 組合，則使用 random search 做 x 次 sample，沒 sample 到前 K 名的機率為 $1-(1-K/N)^x$。若 N=1000 且想要排在前 10 名 (K=10) 的機率大於 90%，其實只要做 230 次實驗；假設只想要百名內，只要 sample 22 次就可以。 Model-based Hyperparameter Optimization - bayesian approach ![](https://i.imgur.com/ujHohNU.png) ### 最近的 approach #### AutoML Reinforcement Learning ![](https://i.imgur.com/CZLvd0o.png) #### Learning Rate **Google - PowerSign** 現在我們看到所有 optmizer 的 strategy 包括 SGD、RMSProp、Adam 等都可以看成是三個 operation 所構成的 ![](https://i.imgur.com/0gm1IMV.png) #### Activation Function ![](https://i.imgur.com/4TKSDop.png) #### Neural Architecture Search with Reinforcement Learning ![](https://i.imgur.com/t3EP7sh.png) **NAS 的研究很花計算力?** Efficient Neural Architecture Search via Parameter Sharing. arXiv 2018 - 只需要不到 16 hours with Nvidia GTX 1080Ti GPU