Elastic Weight Consolidation

# Elastic Weight Consolidation Github Code：https://github.com/moskomule/ewc.pytorch/tree/master Github Code2：https://github.com/kuc2477/pytorch-ewc 國外筆記：https://www.khoury.northeastern.edu/home/hand/teaching/cs7150-summer-2020/Continual_Learning_and_Catastrophic_Forgetting.pdf Medium with code：https://pub.towardsai.net/overcoming-catastrophic-forgetting-a-simple-guide-to-elastic-weight-consolidation-122d7ac54328 另一篇數學推導論文：https://arxiv.org/pdf/2105.04093.pdf https://blog.csdn.net/qq_43428929/article/details/125868016 ![image](https://hackmd.io/_uploads/SJ3WaIsx0.png) Laplace approximation??? Fisher information matrix??? ## 概念 ![image](https://hackmd.io/_uploads/ryLz2JTeC.png) 藍線是 Fine-tune 時沒有加任何限制的狀態，會直接對新任務作最佳化，沒有辦法應用在原任務綠線是直接對所有移動方向都加上限制（L2 Penalty），會學不好新任務所以要彈性控制更新方向，走到一個對 Task A 和 Task B 來說都好的地方（紅線）有點像是加一個彈簧把參數從原本的地方拉住，而且每個彈簧的力量是不一樣的（be imagined as a spring anchoring the parameters to the previous solution...Importantly, the stiffness of this spring should not be the same for all parameters;） Many configurations of θ will result in the same performance：首先假設task A與task B存在一個公共解,這個解應該是存在的(我們將task A與task B合在一起訓練,得到的解便是公共解) 1、選擇對於舊任務而言,較為重要的權重 2、在第一步的基礎上,對權重的重要性做一個排序 3、在學習新任務時,盡量使步驟二的權重不發生太大改變,即在損失函數中添加了一個正則項,重要性大的神經元權重變化大,則釋加的懲罰也大 ## 數學 ![image](https://hackmd.io/_uploads/BJd-GxTlC.png) 用機率的視角來看 Neural Network，套用貝氏定理轉出以下式子 ![image](https://hackmd.io/_uploads/BJbE8e6eR.png) 我的理解： $P(\theta|D)$：給定某 Data，參數應該填入的 value 的機率 $P(D|\theta)$：給定一組參數，可以在某 Data 上做對的機率 $P(\theta)$：在還沒看過任何 Data 之前的參數如果 Data 拆成 A 和 B 兩部分，並 Sequential 的訓練： ![image](https://hackmd.io/_uploads/ry_6vg6xA.png) 其中黃色螢光筆那項可以用 $-Loss(\theta)$ 代替（因為依照代表的意義，$P(D|\theta)$ 越大越好，而 Loss 越小越好，兩者相反） ![image](https://hackmd.io/_uploads/SJuKKxpe0.png) 但要去真的統計和計算 $P(\theta | A)$ 是很難處理的這裡根據 Laplace Approximation，用 $\theta^*_A$ 的平均值還有 Fisher information matrix 的對角精度矩陣當作 variance 近似成 Guassian Distribution ![image](https://hackmd.io/_uploads/rJiPYxTgA.png) ![image](https://hackmd.io/_uploads/rklNSZpeA.png) 我的理解：Fisher information Matrix 的對角 imply 每個參數的不確定性 Fine-tune 在新任務時，如果有個參數一直不斷大幅度震盪，就會受到較大懲罰，因為 Fisher Matrix 的對角元素較大，該參數的值就會優先保持在對於原本任務較好的值；相反的， Fine-tune 在新任務時，如果一個參數很確定的在一個相近的範圍內，他受到的懲罰較小，可以較自由的移動。 λ 為可控的 hyper param，代表要多大的限制參數的移動。 ## 實驗依順序 train Task A、Task B、Task C，並且當一個任務訓練完成後，就不再使用該任務之資料實驗使用數據：MNIST Dataset，並且每一個 Task 分別用不同的固定隨機打亂排列法比較3種做法：SGD with dropout、L2 Regularization、EWC ![image](https://hackmd.io/_uploads/HyyubGagC.png) SGD：災難性遺忘 L2：Fine-tune 根本學不起新任務 EWC：綜合表現都不錯 ![image](https://hackmd.io/_uploads/rJItfMTlA.png) 論文中還提到，他們檢視對於不同任務，到底傾向用不同區塊的Network 還是 share params 得知如果任務相差越遠，Early Layers 重疊得越少；如果任務相差越少，就會 Share 較多 weights ![image](https://hackmd.io/_uploads/HJ8fmMTgC.png) 但是在靠近 output 基本上 parameters 是共享的。