Learning without Forgetting

{%hackmd SybccZ6XD %} ###### tags: `paper` # Learning without Forgetting Goal > learning new visual capabilities while maintaining performance on existing ones. ## Related method Original model > ![](https://hackmd.io/_uploads/ryUi8zrVn.png) The meaning of color in the diagram > ![](https://hackmd.io/_uploads/H1sJwMH43.png) Fine-tuning > drawback: degrades performance on previously learned tasks because the shared parameters change without new guidance for the original task-specific prediction parameters. > ![](https://hackmd.io/_uploads/Skl-68GH42.png) Feature Extraction > drawback: underperforms on the new task because the shared parameters fail to represent some information that is discriminative for the new task. > ![](https://hackmd.io/_uploads/Sk3XDMBVh.png) Joint Training > drawback: if the training data for previously learned tasks is unavailable. > ![](https://hackmd.io/_uploads/rJNrDfHEh.png) ## Learning without Forgetting Learning without Forgetting > Similar to Joint Training, but does not need the old task’s images and labels. > ![](https://hackmd.io/_uploads/S1E8wMHE3.png) Algorithm > ![](https://hackmd.io/_uploads/ry3o_MB43.png) Loss (for new task) > $\hat y_n$: softmax output of the network > $y$: ground truth > ![](https://hackmd.io/_uploads/rkf75fBN2.png) Loss (for old task) > $y_o'$: recorded probabilities > $\hat y_o'$: current probabilities > ![](https://hackmd.io/_uploads/By3tqMSV2.png) > ![](https://hackmd.io/_uploads/BJu55MBE3.png) Relationship to joint training > the distribution of images from these tasks may be very different, and this substitution may potentially decrease performance. Therefore, joint training’s performance may be seen as an upper-bound for our method. Limitations - it cannot properly deal with domains that are continually changing on a spectrum (e.g., old task being classification from topdown view, and new task being classification from views of unknown angles) - LwF requires all new task training data to be present before computing their old task responses - the ability of LwF to incrementally learn new tasks is limited, as the performance of old tasks gradually drop - the gap between LwF and joint training performance on both tasks are larger when experimented on VGG structure