# Overcoming catastrophic forgetting in neural networks
###### tags: `references`
中文詳解:https://blog.csdn.net/dhaiuda/article/details/103967676
https://zhuanlan.zhihu.com/p/25905742
## Introduction
1. Catastrophic forgetting: knowledge of the previously learned task (task A) is abruptly lost when information revelent to the current task (task B) is incorporated.
2. Catastrophic forgetting occurs when the networks is trained sequentially on multiple tasks because the weights in the networks that are important for task A are changed to meet the objectives of task B
3. Multitask learning: data from all task are simultaneously available during training
4. If tasks are presented sequentially, multitask learning can be used only if the data are recorded by an episodic memory system and replayed to the network during training, which is impractical when learning large numbers of tasks.
5. In neurobiology, mammalian brains can learn multiple tasks sequentially because they can protect previously learned knowlegde un neocritical circuits.
6. Continual learning in the neocortex relies on task-specific synaptic consolidation(觸突鞏固), whereby knowledge is durably encoded by reducing plasticity of a proportion of synapses and therefore stable over long timescales.
7. Elastic weight consolidation (EWC): slows down learning on certain weights based on how important they are to previously seen tasks

* When a mouse acquires a new skill, a proportion of excitatory synapses (突觸) are strengthened; this manifests as an increase in the volume of individual dendritic spines of neurons (樹突棘). Critically, these enlarged dendritic spines persist despite the subsequent learning of other tasks, accounting for retention of performance several months later. When these spines are selectively “erased,” the corresponding skill is forgotten
## Method

* To find the most probable values for model parameters given some data $D$: $\log{p(\theta|D)} = \log{p(D|\theta)} + \log{p(\theta)} - \log{p(D)}$ [1]
* Assume that the data are split into two independent parts, $D_A, D_B$. $D_A$ defines task $A$ and $D_B$ defines task $B$. [1] can be rearranged as
$\log{p(\theta|D)} = \log{\frac{p(\theta,D_A, D_B)}{p(D_A, D_B)}}=\log{\frac{p(\theta,D_B|D_A)p(D_A)}{p(D_B|D_A)p(D_A)}}=\log{\frac{\frac{p(\theta, D_B, D_A)}{p(D_A, \theta)}\quad\frac{p(D_A,\theta)}{p(D_A)}}{p(D_B|D_A)}}=\log{\frac{p(D_B|D_A,\theta)p(\theta|D_A)}{p(D_B|D_A)}}$
since $D_A$ and $D_B$ are independent, $p(D_B|D_A)=p(D_B)$, $p(D_B|D_A,\theta)=p(D_B|\theta)$
$\log{p(\theta|D)}=\log{p(D_B|\theta)}+\log{p(\theta|D_A)}-p(D_B)$ [2]
* $\log{p(D_B|\theta)}$ depends only on the loss function for task $B$
* $\log{p(\theta|D_A)}$ is the information of task $A$, but it is intractable
* Fisher information matrix
* Fisher information is a way of measuring the amount of information that an observable random variable $X$ carries about an unknown parameter $\theta$ under the probability of $X$ which depends on $\theta$ ($X$在以$\theta$為參數的分佈下,帶了有多少有關$\theta$的資訊)
* score: the expected value of first derivative of the likelihood function
$E[\frac{\partial}{\partial\theta}\log{f(X;\theta)}|\theta]=E[\frac{1}{f(X;\theta)}\frac{\partial}{\partial\theta}f(X;\theta)|\theta]$
if $\theta$ is exactly the true parameter, the above equals to $\int_{}{}\frac{1}{f(x;\theta)}[\frac{\partial}{\partial\theta}f(x;\theta)]f(x;\theta)dx=\frac{\partial}{\partial\theta}\int_{}{}f(x;\theta)dx=\frac{\partial}{\partial\theta}1=0$
first derivative equals to 0 means $\theta$ maximize the log likelihood.
* Fisher information: the variance of score $I(\theta)=E[(\frac{\partial}{\partial\theta}\log{f(X;\theta)})^2|\theta]$
* if $\log{f(X;\theta)}$ is twice differentiable,

* high second derivative means the log likelihood function carries more information of $\theta$ and is more certain about the estimation

* Matrix form
* when there are $N$ parameters $\Theta = [\theta_1, \theta_2,..., \theta_N]$, then the Fisher information takes the form of an $N\times N$ matrix: $F_{i,j}=E[(\frac{\partial}{\partial\theta_i}\log{f(X;\Theta)})(\frac{\partial}{\partial\theta_j}\log{f(X;\Theta)})|\Theta]=-E[(\frac{\partial^2}{\partial\theta_i\partial\theta_j}\log{f(X;\Theta)})|\Theta]$
* it can be computed from first-order derivatives alone and is thus easy to calculate even for large models
* it is guaranteed to be positive semidefinite.
* Compute $\log{p(\theta|D_A)}$ in [2]
* $p(\theta|D_A)=\frac{p(D_A|\theta)p(\theta)}{p(D_A)}$, since $p(D_A)$ and $p(\theta)$ are constant, we only need to compute $p(D_A|\theta)$
* Taylor expansion: at $\theta=\theta^*_A$,
$\log{p(D_A|\theta)}=log{p(D_A|\theta^*_A)}+\frac{\partial\log{p(D_A|\theta^*_A)}}{\partial\theta}(\theta-\theta^*_A)+\frac{\partial^2\log{p(D_A|\theta^*_A)}}{\partial\theta^2}(\theta-\theta^*_A)^2$
if $\theta^*_A$ maxmize the log likelihood of task A, first derivative $\frac{\partial\log{p(D_A|\theta^*_A)}}{\partial\theta}=0$ and second derivative $\frac{\partial^2\log{p(D_A|\theta^*_A)}}{\partial\theta^2}<0$
Therefore, $\log{p(D_A|\theta)}=log{p(D_A|\theta^*_A)}+F(\theta-\theta^*_A)^2$ [3]
* Assume each parameters are independent, $\log{p(D_A|\theta)}=\log{p(D_A|\theta_1,...,\theta_n)}=\log{\frac{p(D_A,\theta_1,...,\theta_n)}{p(\theta_1,...,\theta_n)}}=\log{\frac{p(D_A, \theta_1)...p(D_A,\theta_n)}{p(\theta_1)...p(\theta_n)}}=\log{p(D_A|\theta_1)...p(D_A|\theta_n)}$
$=\sum{\log{p(D_A|\theta_i)}}=\sum{\log{p(D_A|\theta_{A,i}^*)}+F_i(\theta_i-\theta_{A,i}^*)^2}$ [4]
since $p(D_A|\theta^*_{A, i})=p(\theta^*_{A, i}|D_A)\frac{p(D_A)}{p(\theta^*_{A, i})}$
$\log{p(D_A|\theta)}=\sum{\log{p(\theta^*_{A, i}|D_A)}+\log{p(D_A)}-\log{p(\theta^*_{A, i})}+F_i(\theta_i-\theta_{A,i}^*)^2}$ [5]
* Elastic weight consolidation (EWC): [2] can be rearragned as the loss function $L(\theta)=L_B(\theta)+\sum{\frac{\lambda}{2}F_i}(\theta_i-\theta_{A,i}^*)^2$
## Experiments
1. Linear network:
* dataset: Random Pattern
a random n-dimensional vectors xt with $\pm\frac{1}{\sqrt{n}}, n=1000$ as its elements, a random binary output yt
* at time step $i$, only the $i$th pattern is acecessible for training the parameters $W$.
* $W$ is found by least-square minimization
* loss function at step $i$ at metric $M_i$: $L_i=\frac{1}{2}((W_ix_i-y_i)^2)+|W_i-W_{i-1}|_{M_i}$
* Gradient Decent: fixed, uniform metric
* EWC: diagonal of Fisher information matrix
* evaluation: singal noise ratio
* signal: average memory strength of a memory(在第一次看到pattern $i$ 之後 過了一段時間後 是否能夠答對pattern i)
* noise: variance of signal

* bottom: the fraction of pattern whose SNR exceeds 1
2. Supervised learning: [github](https://github.com/ariseff/overcoming-catastrophic/blob/master/experiment.ipynb)
* Dataset: MNIST
* Model: multilayer fully connected network
* For each task, we generated a fixed, random permutation by which the input pixels of all images would be shuffled

* SGD performs bad on previous tasks after learning new tasks
* L2 has difficulties in learning new tasks as the L2 regularation constraint protects all weights equally, cause little spare capacity for learning new tasks
* Fisher overlap:
* Frechet distance: measure the difference between probability distributions
* Denote Fisher information matrix for task A, B as $F_A, F_B$; normalized these to each have unit trace , $\hat{F_A}, \hat{F_B}$, $d(\hat{F_A}, \hat{F_B})=\frac{1}{2}tr(\hat{F_A}+\hat{F_B}-2(\hat{F_A}\hat{F_B})^{1/2})$
* Fisher overlap is defined as $1-d$
* A small overlap means that the two tasks depend on different sets of weights
* A large overlap indicates that weights are being used for both of the two tasks (i.e., EWC enables sharing of representations)
* When then the two tasks are more dissimilar from each other, the network begins to allocate separate weights for the two tasks. However, the layers closer to the output are being reused for both tasks.
* When a network is trained on two tasks that are very similar to each other, the tasks depend on similar sets of weights throughout the whole network
3. Refinforcement learning
* Dataset: 10 games on Atari 2600
* Model: Deep Q Networks (DQN)
* At training time, the agent was exposed to experiences from each game for extended periods of time. The order of presentation of the games was randomized and allowed for returning to the same games several times.
* To infer which task is currently being performed, we use Forget Me Not Process (an online clustering algorithm) and compare it with directly providing the true task label.

* DQN agent with EWC that learns 10 games does not performs as well as 10 DQN agents learning a single game for each.

* Check the quality of estimation by the Fisher information
* Trained an agent on a single game and measured how perturbing the network parameters affected the agent’s score.
* Parameters with lower Fisher information (higher inv. fisher) should cause less affect on the performance comparing with those with higher Fisher information under the same perturbation size. (Fisher information 大的真的是重要的參數)
* Perturbing the null space is expected to have no effect on performance. Empirically, however, perturbing the null space has the same effect as the inv. Fisher
* Overconfident about certain parameters being unimportant(Fisher information 小的不一定那麼不重要)
$(F+\lambda I)^{-1}$