Continual learning of context-dependent processing in neural networks

# 11/3 Paper #15 ### Continual learning of context-dependent processing in neural networks [toc] [Paper link](https://www.nature.com/articles/s42256-019-0080-x.epdf?author_access_token=889XsK8ycbi_0xZ2FpJw1dRgN0jAjWel9jnR3ZoTv0P8gyRnkxk0BV8NCNOCrgcNeUN5GL2D9aBLeyhVvWlMz-ly29hf71ljZ6ZieFK9RStbagxhi6eg7KHglOEgwY1c5v7Tz9xR8OLYDzqw23OWYQ%3D%3D) --- ### 預備知識 <center>RLS(Recursive Least Square)</center> --- ### Background <center> prefrontal cortex </center> ![](https://i.imgur.com/pqgzgGa.png) Function: <center>Learn “ rules of the game ” and dynamically apply them <br> to map the sensory inputs to different actions in a context-dependent way. </center> --- ### 作者 <center>Guanxiong Zeng, Yang Chen, Bo Cui Shan Yu University of Chinese Academy of Science </center> <center>中科院 <br> 2018/10 on Nature Machine Intelligence </center> --- ### Abstract: Using **Orthogornal Weight Modification(OWM)** with addition Module named "**Context-dependent processing"(CDP)** to solve Catastrophic Forgetting On Sequential learning. --- ### What do we need ? * Sequential Learning manner * change the representation of sensory information to form new task --- ### Overview Concept: * Sequential Learning manner * change the representation of sensory information to form new task ---- Tools: * OWM (Orthogornal Weight Modification)- Over Catastrophic Forgetting * CDP (Context Depedent Process) Reuse Feature from big model and apply for other task ---- ### OWM Advantage: Didn't store All Previously Data work on Over Catastrophic Forgetting. Disadvantage: cost time to update! ---- **Orthogornal Weight Modification** Idea: The reason why the catastrophic Forgetting happen because the previously task weight are modified on current task. ---- ![](https://i.imgur.com/l3LrDAl.png) --- Trivial Idea: (EWC, MAS, SI) Then, why don't we keep important weight not drifted more to keep performance?! --- **Adv** : the implementation is easy and alleviate forgetting. **DisAdv** : model will learn a compromised weight distributions in the end. All tasks will not have a double-win performance.Also the performance for all is not well. ---- ![](https://i.imgur.com/W1XngT8.png) New Idea: Model weights can only be modified in the direction orthogonal to the subspace spanned by all previously learned inputs ?! OWM idea!!! --- ### Status * ~~Sequential Learning manner~~ * change the representation of sensory information to learn new task --- ### CDP PFC-like Module(Context-Dependent Process) ![](https://i.imgur.com/Dy0uqpd.png) Specify the task name to change the representation of sensory information for learning a new task. ---- #### Why need this ? system can't accomplish context-dependent learning by itself. ![](https://i.imgur.com/1MPlUcr.png) this module function: rotate the feature and extract the important information for current task. ##### Discussion Some trivial Solution for this problem: One task for a new one classifier Adv: Easy to implement! Disadv: Scalability and suffer forgetting Problem (shared weight between feature extractor and classifier) ![](https://i.imgur.com/RGuBgLO.png) --- ### Experiment, Result: * Check OWM is scalable * Check PFC-Like module is functional * Chinese Charactor Classification * Celebrity Attribute Classification ---- * Check OWM is scalable (Sequential Training vs Joint training) Data Set | Classes | Feature Extractor | Concurrent Training by SGD (%) | Sequential Training by OWM (%) | Sequential Training by SGD (%) --- | --- | --- | --- | --- | ---| ImageNet | 1000 | ResNet152 | 78.31% | 75.24% | 4.27% | HWDB1.1 | 3755 | ResNet18 | 97.46% | 93.46% | 35.86% | ---- * Check PFC-Like module is functional Learn 40 different mappings sequentially with the same sensory inputs. Learn a single classifier for all 40 different, context-specific mapping rules for 40 task. Compare above setting vs multitask training for 40 different separate classifiers. ![](https://i.imgur.com/koPLRWY.png) --- Detail: Orthognoal Weight Modification: a). Initialization of parameters: randomly initialize $W_l(0)$ and set $P_l(0) = \dfrac{I_l}{\alpha}$ for $l = 1, ...,L$ b). Forwawrd propagate the uinputs of the $i$th batch in the $j^{th}$ task, then back propagate the errors and calculate weight modifications $\Delta W_{l}^{BP} (i,j)$ for $W_l(i,j)$ by the standard BP method. ---- c). Update the weight matrix in each layer by     $W_l(i,j) = W_l(i-1,j) + \kappa(i,j) \Delta W_l^{BP}(i,j)$ if $j = 1$     $W_l(i,j) = W_l(i-1,j) + \kappa(i,j) P_l(j-1) \Delta W_l^{BP}(i,j)$ if $j = 1$       where $\kappa(i, j)$ is predefine learning rate d). Repeat steps from b) to c) ---- e). If the $j^{th}$ task is accomplished. forward propagate the mean of inputs for each batch $(i=1,...,n_j)$ in the $j^{th}$ task successively.<br> In the end, update $P_l$ for $W_l$ as $P_l(j)$ for $W_l$ as $P_l(j) = P_l(n_j,j)$,<br> where $P_l(j) = P_l(n_j,j)$ can be calculated iterativelu according to:     $P_l(i,j) = P_l(i-1,j) - k_l(i,j)x_{l-1}(i,j)^T P_l(i-1,j)$     $k_l(i,j) = P_l(i-1,j)x_{l-1}(i,j) / [1+x_{l-1}(i,j)^TP_l(i-1,j)x_{l-1}(i,j)]$ in which $x_{l-1}(i)$ is the output of the $l-1^{th}$ layer in response to the mean of the inputs in the $i^{th}$ batch of the $j^{th}$ task, and $P_l(j-1).$ f). Reapt steps from (b) to (e) for the next task. ---- Extra Experiment Support: --- leave: ![](https://i.imgur.com/PDOXena.png) ![](https://i.imgur.com/EjxAzhA.png) ![](https://i.imgur.com/oJIlMwc.png) ![](https://i.imgur.com/jXch2WK.png) ![](https://i.imgur.com/iYHfr8C.png) ![](https://i.imgur.com/YpRRpDN.png) ![](https://i.imgur.com/bZG2RTh.png) ---