Notes on "FearNet: Brain-Inspired Model For Iincremental Learning"

# Notes on "FearNet: Brain-Inspired Model For Iincremental Learning" ###### tags: `continual learning` ### Author [Rishika Bhagwatkar](https://https://github.com/rishika2110) ## Introduction * Methods have been developed to mitigate catastrophic forgetting. However, they are not sufficient and perform poorly on larger datasets. * Incorporating the replay mechanisms in incremental learning of a large dataset would require weeks of time. * We want a model that is not only computationally efficient but memory efficient too as the major part of the application of incremental learning is real-time operation on-board embedded platforms that have a limited amount of both compute and memory. * FearNet uses pseudorehearsal for mitigating catastrophic forgetting. It consists of 3 brain-inspired sub-systems: * A memory system for recent memories resembling the hippocampal complex (HC) * A memory system for long-term storage resembling the medial prefrontal cortex (mPFC) * A sub-system that determines which memory system to use for a particular example resembling the basolateral amygdala (BLA) ## Related Work * Fixed Expansion Layer (FEL) * It uses two hidden layers. The second one has connectivity constraints, known as the FEL layer. * The FEL layer is much larger than the first hidden layer. * Large numbers of units are required to work well. * GeppNet * It uses a self-organizing map (SOM) to reorganize the input onto a two-dimensional lattice which serves as long-term memory. * SOM is updated only if the input is new. * It uses a cumulative rehearsal mechanism. * GeppNet+STM * It uses a fixed-size memory buffer to store new examples. * It replaces the oldest example once memory is full. * iCaRL: Incremental Classifier and Representation Learning * It is an incremental class learning framework. * DNN is trained for supervised representation learning * It uses partial replay. The model learns on a mixture of the current study session and $J$ stored examples from previous study sessions. Approximately, it stores $J/T = 20$ examples of each class for CIFAR-100. * It classifies test examples on basis of the mean of its embedding. * However, its performance is heavily influenced by the number of examples stored per class.  ## The FearNet Model During sleep phases, FearNet uses a generative model to consolidate data from HC to mPFC through pseudorehearsal. ### Dual-Memory Storage * HC estimates the probability that an input feature vector $\mathbf{x}$ belongs to class $k$ as \begin{equation} P_{HC}(C =k|\mathbf{x}) = \frac{\beta_{k'}}{\sum_{k'}\beta_{k'}} \end{equation} where ${\beta_{k'} = (\epsilon+\text{min}_j||\mathbf{x}-\mu_{k,j}||_2)^{-1}}$ wher $\mu_{k,j}$ is the $j’$th stored exemplar in HC for class k $\beta_{k'}$ is zero if HC doen't contain class $k$ * mPFC is an AE which reconstructs its input and models probability $P(C=k|\mathbf{x})$ i.e classification task using softmax. Loss function is sum of both losses i.e \begin{equation} \mathcal{L}_{mPFC} = \mathcal{L}_{reconst}+\mathcal{L}_{class} \end{equation} For reconstruction, the $\mathcal{L}_{reconst}$ is weighted sum of MSE of reconstruction losses from each layer. The weight $\lambda_j$ decreases with the depth of network. * When training is completed during a study session, representations of data are extracted by passing through the encoder, and then a mean $\mu_c$ and covariance matrix $\Sigma_c$ is calculated for class $c$. ### Pseudeorehearsal for memory consolidation * In the sleep phase, intrinsic replay occurs i.e pseudoexamples of classes stored in HC are created and moved to mPFC, clearing HC. * Pseudeoexamples for a class $c$ are generated by passing its embedding, extracted by sampling a gaussian corresponding to its mean and covariance matrix, to the decoder of AE. * mPFC is trained on a balanced data created. An average number of examples of each class in HC are generated and mixed with the data in HC. ## Network selection using BLA * Output of BLA is given by $A(x) \in (0,1)$ with 1 indicating mPFC should be used for prediction for a sample. * BLA is trained on the same data as mPFC is trained on. * It was found that results improved when the decision of choosing memory system included the output of HC and mPFC too. ## Experiments They compared FearNet to FEL, GeppNet, GeppNet+STM, iCaRL on CIFAR-100, CUB-200, and AudioSet. **Performance Metrics:** After each study session $t$ in which a model learned a new class $k$, following things were calculated: 1. Test accuracy on new class $\alpha_{new,t}$ 2. Accuracy on base-knowledge $\alpha_{base,t}$ 3. Accuracy of all test data up till that point $\alpha_{all,t}$ After all $T$ study sessions are complete, 1. Model’s ability to retain the base-knowledge is given by $\Omega_{base} = \frac{1}{T-1}\sum_{t=2}^{T}\frac{\alpha_{base,t}}{\alpha_{offline}}$ 2. Model’s ability to immediately recall new information is given by $\Omega_{new} = \frac{1}{T-1}\sum_{t=2}^{T}\alpha_{new,t}$ 3. Model's ability to perform on all available test data is given by $\Omega_{all} = \frac{1}{T-1}\sum_{t=2}^{T}\frac{\alpha_{all,t}}{\alpha_{offline}}$ ## Discussion and Conclusion * FearNet’s memory footprint is comparatively small because it only stores class statistics rather than some or all of the raw training data, which makes it better suited for deployment. * If there are more than 1 study session of one class then running class statistics can be computed. However, it may be better to favor the data from the most recent study session due to learning in the AE. * FearNet is capable of recalling and consolidating recently learned information while also retaining old information.