Notes on "REMIND Your Neural Network to Prevent Catastrophic Forgetting"

# Notes on "REMIND Your Neural Network to Prevent Catastrophic Forgetting" ###### tags: `continual learning` ### Author [Rishika Bhagwatkar](https://https://github.com/rishika2110) ## Introduction * Conventional ANNs suffer from catastrophic forgetting during online learning from non-iid data. * Various algorithms have been developed to mitigate catastrophic forgetting, out of which variants of Replay have been the most successful. * There are 2 major factors in which methods of lifelong learning in CNNs and mammals differ in learning: * Firstly, according to the hippocampal indexing theory, the hippocampus stores compressed representations of data that are high in visual processing hierarchy which differs from the replay implementations in the existing models as the replay is stord as a representations of raw pixels. * Secondly, animals undergo streaming learning or learning in resource constraint environment from non-iid data. However, incremental batch online learning, the most common paradigm for incremental learning in CNNs, collect an entire batch of new examples and loop over it, once the entire batch is received, till model adequately learns. Moreover, reducing the batch size and looping over only once, tending towards streaming learning results in degradation of performance. REMIND or **R**eplay using **M**emory **Ind**exing is inspired by the hippocampal indexing theory and biological replay.  ## Related Work * ExStream * iCaRL: * It is an incremental class learning framework. * DNN is trained for supervised representation learning * It uses partial replay. The model learns on a mixture of current study session and $J$ stored examples from previous study sessions. Approximately, it stores $J/T = 20$ examples of each class for CIFAR-100. * It classifies test examples on basis of the mean of its embedding. * However, its performance is heavily influenced by the number of examples stored per class. * End-to-End: * It uses the CNN output directly for classification rather than predicting mean as iCaRL does. * Unified: * It extended End-to-End by using a cosine normalization layer, a new loss constraint, and a margin ranking loss * BiC: * It extended End-to-End by training two additional parameters to correct bias in the output layer due to class imbalance. Above methods store raw pixels for replay and use distillation loss for mitigating catastrophic forgetting. Another method for mitigating forgetting is expanding the network as more data becomes available. However, models using this method have more memory requirements and may not scale to thousand-category datasets. ## REMIND * A CNN is trained which is formulated as $y_i = F(G(\mathbf{X_i}))$ where $\mathbf{X_i}$ is input image and $y_i$ is corresponding category label. It has been divided into 2 functions $F(.)$ and $G(.)$ consisting of the first $J$ layers and the last $L$ layers respectively because the parameters of first section of the network are kept fixed while the other section is trained using the REMIND on streaming paradigm. * The output of the first section of network is $\mathbf{Z_i} = G(X_i)$ and $\mathbf{Z_i}$ $\in$ $\mathbb{R}^{m \mathsf{x} m \mathsf{x} d}$ where m is the dimension of feature map and d is the number of channels. * Product Quantisation is used to compress and store $\mathbf{Z_i}$. * PQ partitions each $d$-dimensional tensor element into s sub-vectors, each of size $d/s$. * Then, $k$-means clustering is done on each of the $s$ subvectors. We get centroids learned for each subvector. The centroids are called codes and the subvector is called as codebook. So, in total there are $s$ codebooks of $m \mathsf{x} m$ rows and $d/s$ dimensions. * Each $d$-dimensional tensor is now represented as a $s$ dimensional vector. (Each partition of the tensor is replaced by the centroid id of centroid which is closest) ### Augmentation During Replay 1. The quantised tensors are randomly resized, then cropped and bilinearly interpolated to match the original tensor dimensions. 2. REMIND mixes features from multiple classes using manifold mixup. ### Initializing REMIND 1. Base initialization is done in offline mode where both $\theta_F$ and $\theta_G$ are trained. After that only $\theta_F$ is plastic. 2. All of the examples $\mathbf{X_i}$ in the base initialization are pushed through the model to obtain $\mathbf{Z_i} = G(X_i)$, and all of these $\mathbf{Z_i}$ instances are used to learn the quantization model for $G(X_i),$ which is kept fixed once acquired. ## Experiments ### Comparison Models All of the models are experimented in streaming paradigm * REMIND * Fine-Tuning (No Buffer) * ExStream * SLDA * iCaRL * Unified * BiC * Offline ### Model configurations All models, except SLDA, are trained using cross-entropy loss with stochastic gradient descent and momentum. ### Datasets, Data Orderings, & Metrics * Experiments were conducted on ImageNet and CORe50. * Four types of data ordering were tested: * iid * class iid * instance * class instance * Performance Metric: $\Omega_{all} = \frac{1}{T}\sum_{t=1}^{T}\frac{\alpha_{t}}{\alpha_{offline,t}}$ where $T$ is the number of test events and $\alpha_t$ is the accuracy of the model for test $t$ and $\alpha_{offline,t}$ is the accuracy of the optimized offline learner for test $t$. * top-5 and top-1 accuracies for ImageNet and CORe50 respectively are reported. ## Discussion and Conslusion