Deep Generative Replay

--- title: Deep Generative Replay # 簡報的名稱 tags: LLL # 簡報的標籤 slideOptions: # 簡報相關的設定 progress: true slideNumber: True overview: true theme: solarized # 顏色主題 transition: 'fade' # 換頁動畫 spotlight: enabled: true # parallaxBackgroundImage: 'https://s3.amazonaws.com/hakim-static/reveal-js/reveal-parallax-1.jpg' --- # 9/4 Paper #13 ## Continual Learning with Deep Generative Replay --- [TOC] --- ### Abstract ---- > Catastrophic Forgetting is a crucial problem for sequential learning. Although we can replay all previous old tasks data to alleviate this problem. It require large storage and dataset, and it is difficult to use on real world applications. They proposed a novel framework which just need two cooperative model("solver" and "“generator") to sample and interleave previous data to avoid catastrophic forgetting. --- ### Motivation ---- > Inspired by hippocampus as a short-term memory system in primate brain. > The Complementary Learning Systems (CLS) theory illustrates the significance of dual memory systems involving the hippocampus and the neocortex. > The hippocampal system rapidly encodes recent experiences, and the memory trace is reactivated during sleep or conscious and unconscious recall. > The memory is consolidated in the neocortex through the activation synchronized with multiple replays of the encoded experience. --- ### Complementary Learning System -CLS * The previous data can't accessable. * only pseudo-input and pseudo-target produced by a memory network can be fed into the task network * The model need to maintain old input-output patterns without accessing real data --- ### GAN ---- $$ \underset{\mathcal{G}}{\min} \underset{\mathcal{D}}{\max} V(\mathcal{D},\mathcal{G}) = \\ \mathop{\mathbb{E}}_{x \sim p_{data(x)}}\big[\log \mathcal{D}(x)\big] + \mathop{\mathbb{E}}_{z \sim p_z(z)}\big[\log(1 - \mathcal{D}(\mathcal{G}(z)))\big] $$ --- ### Loss Definition ---- $$ \mathcal{L}_{train}(\theta_{i}) = r \mathbb{E_{(x,y) \sim \mathcal{D}_i}} \big[ \mathcal{L}(\mathcal{S(\mathcal{x;\theta_i}),y}) \big] \\ + (1-r) \mathbb{E}_{x_{fake} \sim \mathcal{G_{i-1}}} \big[ \mathcal{L(S(x_{fake}; \theta_i),S(x_{fake};\theta_{i-1}))} \big] $$ $$ \mathcal{L}_{test}(\theta_{i}) = r \mathbb{E_{(x,y) \sim \mathcal{D}_i}} \big[ \mathcal{L}(\mathcal{S(\mathcal{x;\theta_i}),y}) \big] \\ + (1-r) \mathbb{E}_{x_{old} \sim \mathcal{D_{i-1}}} \big[ \mathcal{L(S(x_{old}; \theta_i),S(x_{old}; y_{old}))} \big] $$ --- ### Generative Replay ---- sequence $T$ $=$ $(T_1, T_2, ... T_N)$ of $N$ tasks. **Definition 1** : $T_i$ training examples ($x_i$, $y_i$) **Scholar H** : Learning a new task amd teach its knowledge to other networks. $H$ $=$ $\langle G,S \rangle$ $G$(Generator) : produce real-like samples $S$(solver) : a task solving model parameterized by $\theta$ minimize sum of loss on all task $\mathbb{E}_{(x,y) \sim D}\big[L(S(x;\theta), y)\big]$ --- ### Architecture ![](https://i.imgur.com/fQD5LsL.png) --- ### Preliminary Experiment ![](https://i.imgur.com/orLfDUD.png) * MNIST classification task First Scholar model learned from real data, and subsequent model train from previous scholar model. They want to show that their scholar model can train from scratch from only previous trained scholar model and the test accuracy didn't loss. --- ### Experiments ---- Generator -> WGAN-GP GR: use the generative replay ER: use exact replay Noise: use the data distribution which is not resemble from ground data None: Didn't use any replay method ---- Three experiment Setup 1. sequentially train on independent task (permutation on different task) ![](https://i.imgur.com/fQJHkqa.png) ---- 2. sequentially train on different domain (MNIST SVHN) ![](https://i.imgur.com/hvjuh0g.png) * sample examples from generator ![](https://i.imgur.com/0ozYSQL.png) * compared to LwF performance *![](https://i.imgur.com/tzkQJvH.png) --- 3. Learning new classes Setup: disjoint set of MNIST, One times sample only two class data ![](https://i.imgur.com/MnXNMi6.png) ---- * sample class from generator ![](https://i.imgur.com/e4g1eUm.png) --- ### Conclusion & Discussion ---- DrawBack of LwF: Performance depend on the relevance of the tasks, and the training time for one task linearly increases with the number of former tasks ---- DrawBack of Ewc: Compromise two task to a suboptimal solutions for them but maybe there have two-win solution. ---- maintains the former knowledge with input-output pairs produced with the saved network. DrawBack: Depends on your Generator Quality. ---