Understanding the Role of Training Regimes in Continual Learning

# Understanding the Role of Training Regimes in Continual Learning > Authors: Seyed Iman Mirzadeh, Mehrdad Farajtabar, Razvan Pascanu, Hassan Ghasemzadeh > Conf: NeurIPS 2020. ## 🎯Motivation + Authors found out that different parameters such as `learning rate, batch size,...`; and `regularization method , dropout,...` (Regimes similar to method) can cause different effects to continual learning model. Authors don't want to list all typical methods which will handle with **catastrophic forgetting** problem, instead, authors try to justify why these methods and these parameters can significantly improve the performance of continual learning model. + Particularly, authors will focus on optimization and loss landscape. ## 🔗 Related work Categorized methods into 3 groups: + Replay based method : store knowledge learned from old tasks, known as **experience replay** + Update parameters with constraint : update weight have less importance in previous tasks and keep importance weights (Regularization, EWC, Fisher Infomation Matrix) + Parameter isolation method : each task use specific parameter (ex PathNet,... ). Enable specific gate for each task ![](https://i.imgur.com/nkQXixr.png) ## 📌 Main Point After some proof, authors want to point out that **the wider these minima are, the less forgetting happens** ![](https://i.imgur.com/DCgzD3W.png) ### Optimization setting : learning rate, batch size, optimizer. + Higher learning rate will lead to wider minima (narrow minima will cause overshooting), but beware of high learning rate because higher learning rate will update weight with larger value. Thus, the ideal strategy is that we will initialize large learning rate and after + Optimizer most effective is SGD. SGD will perform more effectively at the later stages. In continual Learning, the most regular method is that using rate decay (starting with a large learning rate, and decay after a certain of epoch); uing small batch size; and use SGD with momentum. ### Regularization : dropout and weight decay (l<sub>2</sub> regularization : weight decay VS dropout) Dropout is data-dependent => Balancing learning vs regularization is harder. L<sub>2</sub> combine with Batch Normalization (prevent gradient too large or too small), thus, we can use large learning rate.