PGMA(Parameter Generator via Model Adaption)

# 8/20 paper 11 ## Overcoming Catastrophic Forgetting for Continual Learning via Model Adaptation --- ## OpenReview: 5, 6, 7 ICLR(2019) ## Authors Wenpeng Hu, Zhou Lin, Bing Liu, Chongyang Tao, Zhengwei Tao, Dongyan Zhao, Jinwen Ma, Rui Yan (Peking University) --- ## Abstract avoid Catastrophic Forgetting: New Method : Parameter Generation and Model Adaptation (PGMA) --- ## Introduction ---- There has multiple method for catastrophic forgetting (LwF,DEN,EWC,SI,IMM,GEM,GR) ---- * **LwF**(Learning Without Forgetting) * **DEN**(Dynamic Expandable Network) Both of them will increase their network architecture when new task need to train. ---- * **EWC**(Elastic Weight Consolidation) * **SI**(Synaptic Intelligence) * **IMM**(Incremental Moment Matching) * **GEM**(gradient episodic memory, * **GR**(generative replay) they named above method all are **JP** (**j**oint **p**arameterization) method. ---- **JP**(joint parameterization): a single joint set of parameters $θ$ to train all task and assume they will perform well.But they have a drawback: **Accuracy Deterioration** ---- **Accuracy Deterioration**: Assume We have two task A, B, the optimal parameter set for A is $θ_A$ but for B is $θ_B$, if we use JP method, we will find a compromise parameters $\theta_C$ between $θ_A$ and $θ_B$, which $\theta_C$ is neither $θ_A$ nor $θ_B$, that approach will cause accuracy deterioration. --- ## FrameWork ---- **PGMA**: There have three component for this architecture: **S**(Solver), **DG**(Data Generator), **DPG**(Dynamic Parameter Generator) ---- * **S (Solver)**: with two Parameter Sets: 1. **$θ_0$**: shared Parameter sets 2. **$P_i$**: generated parameters set **$P_i$** for **$i$**-th Task ---- **DG (Data Generator)**: **DG Encoder**($DG_E$) and **DG Decoder**($DG_D$): Decoder is the main function for generate the old example point for avoid catastrophic forgetting. Encoder is used to generate the embedding **z**. ---- **DPG (Dynamic Parameter Generator)**: $f(*)$ function -> input embedding **z** generate dynamic parameters to adapt each sample. --- ## Overview ![](https://i.imgur.com/peLrI55.png) ![](https://i.imgur.com/tbrWHnD.png) --- ## Experiment ---- Two image datasets: 1. MNIST (60,000/3000/7000) (training/validation/testing) 2. CIFAR-10 (50,000/3000/7000) (training/validation/testing) Shuffled: 3 tasks and 5 tasks ---- Two text datasets: 1. DBPedia ontology (crowd-sourced dataset) (560,000/10000/60000) 14 classes 2. THUCNews (50,000/5000/10,000) (training/validation/testing) ---- three experiment settings, 2 tasks (7 classes per task), 3 tasks (5, 5, and 4 classes for the three tasks respectively), and 5 tasks (3, 3, 3, 3, and 2 classes for the 5 tasks respectively) ---- MNIST -> two task {0,1,2,3,4} + {5,6,7,8,9} CIFAR10 is also the same( divided classed to two task). (5 classes per task) and 5 tasks (2 classes per task) ---- Baseline: 1. EWC 2. IMM 3. GR --- ## training Detail: ---- 3-layer network (generated parameters $p$ and the shared parameters $θ_0$ ) ---- DPG (T-Net): set the size of each hidden layer to 1000 --- ## Results: ---- ![](https://i.imgur.com/X3VixqO.png) ![](https://i.imgur.com/B20tLwq.png) ![](https://i.imgur.com/ZLT2cZG.png) ---- ![](https://i.imgur.com/X5UBrUS.png) ---- ![](https://i.imgur.com/wYMbrSW.png) ---- ![](https://i.imgur.com/1VsWZFk.png) ---- ![](https://i.imgur.com/rtRCcHH.png) --- <style> .reveal { font-size: 24px; } img { width: 50%; height: auto; } div { resize: both; } </style> ---