Overcoming catastrophic forgetting in neural networks

--- title: Overcoming catastrophic forgetting in neural networks # 簡報的名稱 tags: LLL # 簡報的標籤 slideOptions: # 簡報相關的設定 progress: true slideNumber: True overview: true theme: solarized # 顏色主題 transition: 'fade' # 換頁動畫 spotlight: enabled: true # parallaxBackgroundImage: 'https://s3.amazonaws.com/hakim-static/reveal-js/reveal-parallax-1.jpg' --- # 7/5 Paper #7 --- [toc] ## Overcoming catastrophic forgetting in neural networks --- ## Background PNAS 期刊 ---- ## 作者群 James Kirkpatricka, Razvan Pascanua ,Neil Rabinowitza, Joel Venessa， Guillaume Desjardinsa,Andrei A. Rusua, Kieran Milana, John Quana, Tiago Ramalhoa, Agnieszka Grabska-Barwinska,Demis Hassabisa, Claudia Clopathb, Dharshan Kumarana, and Raia Hadsella DeepMind, London, N1C 4AG, United Kingdom Bioengineering department, Imperial College London, SW7 2AZ, London, United Kingdom ---- ## 一作 ![](https://i.imgur.com/KXhnqnp.png) ## 二作 ![](https://i.imgur.com/iFRoH9y.png) --- ## Abstract 解決 catastrophic forgetting, 在model 學習 sequential task時，常常學習完第二個task, 再test在第一個task時，performance會下降，甚至完全爛掉，這個名詞稱為 catastrophic forgetting，所以在這篇paper提出的是一個新的方法叫做EWC(**E**lastic **W**eight **C**onsolidation),test在mnist上。 --- ## OverView ![](https://i.imgur.com/T3BPswE.png) ---- model部份的weight對於task A 的performance很重要，如果我們能針對那些weight不去做改變而是改變其他影響較小的weight，去訓練下一個task，可以同時train好這兩個task。主要是loss function多了一個regularization的term，去找出哪些weight要protect好不去更動，哪些weight動了對於loss也不會有太大的影響(找到更好的optimize位置)去訓練第二個任務也可以顧好第一個任務的performance。 --- ## function ---- Baysien rule: ![](https://i.imgur.com/G6GAFGz.png) ![](https://i.imgur.com/7zwC3Rl.png) ---- ![](https://i.imgur.com/bgisAqk.png) F 為 Fish information Matrix 可以把他想成 weight對於前一個taskㄉ重要程度是一個數值。有興趣可以看一下 fisher information matrix的數學推導 [知乎](https://www.cnblogs.com/chason95/articles/9892560.html) 大致上的處理方式去找二次導數的那ㄧ項，在我optimize 新的任務的時候，因為有regularization term 可以讓我決定哪些參數不能被改太多，哪些參數要改多一點也不會影響到第一個task的performance。 ---- $\log p(\theta|D_A)$ is intractable, so we approximate the posterior as a Gaussion distribution with mean given by $\theta_A^*$ and a diagoal precision given by the diagonal Fisher Matrix F. ---- F 有三個性質（1)equal to second derivative of the loss near a minimum, (2)it can be computed from first-order derivatives alone and is thus easy to calculate even for large models, (3) positive-definite ---- ## Experiment ![](https://i.imgur.com/j3n7z9P.png)