Notes on "Lifelong Machine Learning with Deep Streaming Linear Discriminant Analysis"

# Notes on "Lifelong Machine Learning with Deep Streaming Linear Discriminant Analysis" ###### tags: `continual learning` #### Author [Rishika Bhagwatkar](https://https://github.com/rishika2110) ## Introduction The output layer of the CNN network is trained using Deep Streaming Linear Discriminant Analysis (SLDA) algorithm for continual learning of the network on image classification task. ## Related Work ### Streaming Learning * Incremental Batch Learning: Agent learns dataset $\mathcal{D}$ broken into multiple batches $\mathcal{D} = \sum_{t=1}^{T} B_t$ each of size $N_t$. At time $t$, it has access to only $B_t$ and loops over it till the agent learns adaquately. Testing is done between batches. * Streaming Learning: Agent learns one example at a time ($N_t = 1$) in a single pass, doesn't loop through entire dataset. Testing can be done at any time. There has been very little progress on the streaming learning paradigm. ### Methods for Incremental Batch Learning * Rehearsal or replay: * Partial replay: Agent stores and mixes a subset of previously seen examples with the current examples and learn on this mixture. * Pseudo replay: A generative model (AE or GAN) is trained to generate pseudo examples from the previously seen data. It is mixed with the current examples and the agent learns from this mixture. It is slow and difficult to optimise a generative model. Incremental batch learning is not appropriate for real-time applications as inference is possible only after a batch is accumulated. ### Methods for Streaming Learning * Methods used for combining DNNs and continual learning: * Gradient Episodic Memory (GEM): Regularisation is used to constraint weight updates on new tasks to prevent loss from increasing. However, task label is necessay during inference, without it, the model's performance will degrade. * ExStream: It updates only the fully connected layers of CNN. Partial rehearsal is used for catastrophic mitigation. However, suffers from memory and compute constraints due to rehearsal mechanisms. * Streaming Linear Discriminant Analysis (SLDA): * Maintains one running mean per class and a shared covariance matrix that can be held fixed, or updated using an online update. * For prediction, it assigns a label given as input to closest gaussian computed using running mean and covariance matrix. * Streaming Quadratic Discriminant Analysis (SQDA): * Assumes that each class is normally distributed and each class has its own covariance. * Storing covariance matrix for each class leads to more memory and compute consumptions. The motivation behind using SLDA is that LDA is equivalent to the softmax classifier in ANNs. ## Deep Streaming LDA * A CNN network is decomposed into 2 networks F(.) and G(.), consisting of last layer and first J layers of CNN (ResNet-18) respectively. \begin{equation} y_t = F(G(\mathbf{X_t})) \end{equation} $\mathbf{X_t}$ is the input image and $y_t$ is the output category * The parameters of G(.) are kept fixed. * \begin{equation} F(G(\mathbf{X_t})) = \mathbf{Wz_t} + \mathbf{b} \end{equation} where $\mathbf{z_t} \in \mathbb{R}^{d}$, $d$ is the dimensionality of data, $\mathbf{W} \in \mathbb{R}^{K \mathsf{x} d}$, $K$ is total number of classes and $\mathbf{b} \in \mathbb{R}^{K}$ * Whenever a new datpoint ($z_t, y$) arrives mean vector is computed and associated counter is incremented as \begin{equation} \mu_{(k=y, t+1)} \leftarrow \frac{c_{(k=y,t)}\mu_{(k=y, t)} + z_t}{c_{(k=y,t)} + 1} \end{equation} \begin{equation} c_{(k=y, t+1)} = c_{(k=y,t)} + 1 \end{equation} where $\mu_{(k=y, t)} \in \mathbb{R}^{d}$ is the mean vector of class $y$ at time $t$ and $c_{(k=y,t)}$ is the counter associated with $y$-th class. * Two variants of SLDA are considered: * Covariance matrix is freezed after base initialisation of $G(.)$. * Updating covariance matrix in streaming paradigm. * For updating covariance matrix update is of the form $\sum_{t+1} = \frac{t\sum_{t} + \Delta_t}{t + 1}$ where $\sum \in \mathbb{R}^{d \mathsf{x} d}$ is the shared covariance matix and $\Delta_t$ is computed as \begin{equation} \Delta_t = \frac{t(z_t - \mu_{(k=y,t)})(z_t - \mu_{(k=y,t)})^T}{t+1} \end{equation} * Precision Matrix is computed using shrinkage regulariation as $\Lambda = [(1 - \epsilon)\Sigma + \epsilon\mathbf{I}]^{-1}$, where $\epsilon = 10^{-4}$ is the shrinkage parameter and $\mathbf{I} \in \mathbb{R}^{d \mathsf{x} d}$ is the identity matrix. * For prediction, the rows $\mathbf{w}_k$ of $\mathbf{W}$ are computed as $w_k = \Lambda\mathbf{\mu}_k$ and elements of $\mathbf{b}$ are computed as $b_k = -\frac{1}{2}(\mu_k . \Lambda\mu_k)$ ## Experiments Following methods were evaluated on the ImageNet ILSVRC-2012 and CORe50 datasets: * Deep SLDA - streaming learning method * Fine-Tuning - streaming learning method * ExStream - streaming learning method * iCaRL - incremental batch learning method * End-to-End - incremental batch learning method * Offline - incremental batch learning method **Performance Metric** \begin{equation} \Omega_{all} = \frac{1}{T} \sum_{t=1}^{T} \frac{\alpha_t}{\alpha_{offline,t}} \end{equation} where $\alpha_t$ is incremental learner's performance at time $t$ and $\alpha_{offline,t}$ is optimized offline performance trained on all data until time $t$. ## Conclusion 1. SLDA outperforms iCaRL and ExStream not only in performance but also runs in significantly less time on ImageNet. Additionally, it is extremely memory-efficient as only the covariance matrix is stored. 2. SLDA was able to domain transfer from ImageNet to CORe50, without requiring a base initialization phase and hence is more amenable to applications where a user already has good feature representations and would like to immediately begin streaming learning on a different dataset.