After Rebuttal

## Message to ACs Dear Senior Area Chairs and Area Chairs Firstly, we would like to express our gratitude to the ACs overseeing the review process of our paper. We received valuable reviews from four reviewers and have diligently written responses to each of them. We would appreciate it if the ACs could encourage the reviewers to provide post-responses to our responses during the discussion period. In addition, we would like to raise some concerns on the credibility of the reviews provided by **Reviewers WKs6 and KuSE** as they seem to not have fully understood the main claim of our paper. 1. **WKs6:** The reviewer repeatedly mention that our negative transfer problem is a problem of **stability**. However, this is not true. In our experiments showing the existence of negative transfer (Section 3), we employed *fine-tuning* for continually learning the tasks, which does **not** put any emphasis on stability whatsoever. Namely, while negative transfer also refers to the phenomenon that some tasks cannot be learned due to the task learned in the past, it is not occuring due to the model's effort to maintain stability (i.e., to not forget past task). We tried to make this point in our rebuttal as clear as possible, and we would greatly appreciate AC form making sure the reviewer does not misunderstand our main point. 2. **KuSE:** The reviewer mentions that our experiments and analyses invovle *questionable assumptions* and are *not convincing* without giving any concrete arguments. Moreover, the reviewer argues that our method is *straightforward* and we do not provide *further explanation*, again without any concrete evidence. We did provide the difference of our method with previous work, P&C, through extensive ablation study (Sec 5.3) and showed the components of our method are all essential. Furthermore, the reviewer mentions that our method is a simple application of transfer learning, but our resetting mechanism exactly removes the transfer learning mechanism in order to address the negative transfer. Again, we tried to make this point in our rebuttal as clear as possible, and we would greatly appreciate AC form making sure the reviewer does not misunderstand our main point. With above, we hope the AC can take above points into account when making the final decision on this paper. ## Reviewer j23a (R1) ### Weakness 1: Issues on the groups in two-task experiment Thanks for your comment. The ideal experiments that we could do is to carry out the two-task experiments for *all* possible pairs among 24 tasks in Meta-World. However, doing such experiment becomes extremely expensive -- e.g., on our 4 A6000 GPU server, doing such experiments with 10 different random seeds would take about **94** days (71 days for SAC and 23 days for PPO). To that end, as a proxy, we categorized the 24 tasks into 8 groups, based on the assumption that the tasks with similar names would be similar to each other, and carried out two-tasks experiments for randomly sampled tasks from each group. In this case, the experiments take about **10** days (7.5 days for SAC and 2.5 days for PPO) In this way, we could experiment with various two-task pairs, and verify that the negative transfer indeed occurs frequently. Note that the purpose of our experiment is to show that the negative transfer phenomenon does occur quite often, rather than exactly identifying which tasks are causing the failing, while saving some computational time. Moreover, we re-emphasize that our grouping is done in a loose sense (simlpy by the names of the tasks), hence, even within the same group, the tasks could cause some negative transfers, as are reflected in some diagonal elements of the figures in Figure 3/5. We hope this fact is clear to the reviewer. Furthermore, we also carried out the exhaustive two-task experiments among the 15 individual tasks (i.e., the subset of 24 tasks), of which results we provide in the following link: Note the overall tendency of the two-task experiments among the individual tasks is highly similar to the results in Figure 3. We will attach this result also in the appendix of the final version. - Figure for 2-task experiment with 16 tasks using SAC https://icml2024submission2542.github.io/icml2024submission2542/sac_16tasks.pdf - Figure for 2-task experiment with 16 tasks using PPO https://icml2024submission2542.github.io/icml2024submission2542/ppo_16tasks.pdf ### Weakness 2: Most of the issues seem to be with just one task group (sweep) In the results for SAC, it may seem that the ``Sweep`` task group is the only major group that causes the the negative transfer. However, as can be easily seen from the results of PPO, which is a widely used algorithm in RL, the negative transfer becomes much more prevalent across the groups than for SAC. Furthermore, even in SAC, any significant drops compared to scratch (e.g., ``Push``$\rightarrow$``Push``, ``Handle``$\rightarrow$``Push``, or ``Door``$\rightarrow$``Handle``) stand for the negative transfer. Regarding the diagonal entry, as mentioned in above reply, we hope the reviewer is now clear with this setting as the grouping of the tasks are done loosly by their names. For the results in the diagonal part in Figure 3 and 5, it can be unintuitive that the negative transfer can occur between the tasks in same group. However, since we categorize the task solely based on the first 'name' of each task, the characteristic of the tasks can be different, e.g. 'sweep' versus 'sweep-into'. Furthermore, the main objective of the experiment in Figure 3 and 5 is to show the occurance of the negative transfer problem in broader way. The figures in this link only show the results on 15 tasks, which we think that only using 15 tasks is not enough to show the phenomenon extensively. ### Weakness 3: Plotting the differences, not the success rates Thank you for the constructive comment. We attach the figure plotting the differences. Please refer to the links. - Revised version of Figure 3 by plotting the difference of the success rates between from scratch and fine-tuning https://icml2024submission2542.github.io/icml2024submission2542/finetuning_transfer_difference.pdf - Revised version of Figure 5 by plotting the difference of the success rates between from scratch and other baselines https://icml2024submission2542.github.io/icml2024submission2542/transfer_4methods_difference.pdf ### Weakness 4: Overall, the general organization can be improved Thanks for the comment. Although we did our best to present our experimental results, we will try to improve in the final version -- any specific comments regarding the organization improvement from the reviewer would be greatly appreciated. ### Question 1: How were the hyperparameters tuned for your experiments? First, in all experiments, ClonEx did not search the hyperparameters since just setting the regularization coefficient to 1 is enough to prevent the forgetting. For R&D, we find that setting the coefficient for BC and knowledge distillation to 1 is enough to achieve remarkable performance in long sequence experiment. We think that not needing to search the hyperparameters can be one of the advantage of R&D over other baselines. For the experiments in Figure 1, all the baselines except for Finetuning have the hyperparameters. In ReDo, we set the threshold parameter $\tau=0$ for SAC which uses the ReLU activtaion and $\tau=-0.995$ for PPO which uses the tanh activation. We did not search the thresold for all experiments. In InFeR, as the original paper shows the robustness of its hyperparameters we set the regularization coefficient $\alpha=1$ and the scaling coefficient $\beta=10$, and we also did not search the hyperparameters. In CReLU, since it does not need the hyperparameters, we directly replace the ReLU activation with CReLU. Lastly, for Wasserstein Regularization, we set the regularization coefficient to 0.1 and 0.01 for SAC and PPO, respectively, which are choesen from [0.01, 0.1]. For the experiments in Figure 4 (long sequence experiment), we used same hyperparameters used in experiment in Figure 1, except for EWC and P&C. For EWC and P&C, due to the limited time constraint for running the long task sequence experiment, we searched the hyperparameters through the experiments like in Figure 1. We set the regularization coefficient of EWC and P&C to 1000 which is chosen from [10, 100, 1000]. For the experiments in Figure 5, we also used same hyperparameters used in experiment in Figure 1. To explain the details on the hyperparameters, We will add those comments in the appendix. ### Question 2: Why not R&D the critic as well? The rationale behind utilizing a critic in RL is to enhance the training of the actor. As for the offline (continual) learner in R&D, it learns from the online learner and an expert buffer through knowledge distillation rather than the standard RL algorithm, thus a critic is unnecessary. Consequently, there is no necessity to apply R&D to the critic. ### Response to the comment. > The critic is still used to train the online actor right? Yes. The critic is still used for training the online actor. > Is the critic reset between each task? Yes. At task transition phase, we reset both actor and critic of the online actor. As mentioned in our rebuttal above (Part IV), we do not distill the critic to the offline learner > I think the paper needs more organization (see comments about changes to figures), more analysis, and more domains We agree that additional analysis and experiments can strengthen our claim. For the analysis, exploring the root cause of the negative transfer phenomenon would be definitely interesting -- we focused on identifying and experimentally addressing negative transfer in the current manuscript, but we will pursue this direction in our future work. For the experiments on more domains, we also wanted to pursue, but due to the prohibitive computational requirements for the environments with image modality (e.g. Atari or DMLab), we could not finish the experiments in time. We do conjecture, however, the negative transfer phenomenon would still arise in those environment as well and will also leave this experiments as a future work.   ## Reviewer WKs6 (R2) ### Weakness 1: The premise and some of the arguments in the paper are confusing We would like to clarify the setting that we are considering. Note that our experiments in Section 3 are for the **fine-tuning** method, which is considered to solely focus on the plasticity and has **_no_** emphasis on stability whatsoever (as mentioned in line 132~135, right). Our experiments show that even with such plasticity-focused fine-tuning, the newly arrived task cannot be learned depending on which task comes before. Note this phenomenon is **_not_** due to the **stability**, which refers to not forgetting the previous tasks, since the fine-tuning does not have any mechansim for maintaining the stability. Namely, in our experiment in Section 3, after learning the second task (``push wall``), the success rates for both ``push wall`` and ``sweep into`` becomes poor -- the former due to plasticity loss or negative transfer and the latter due to the catastrophic forgetting (i.e., lack of stability). With this, Section 3.1 discusses that the performance degradataion of the second task (``push wall``) cannot be solely explained with the previous literature on plasticity loss, hence, the negative transfer needs to be explicitly taken care for. We hope the setting and our argument becomes cleaer to the reviewer now. ### Weakness 2: The stability is not discussed enough Since the negative transfer depends on the previous task, it may seem to be related to stability. However, as mentioned above, the negative transfer phenomenon happens even when there is no emphasis on the stability in the learning algorithm. Thus, we believe it is a separate problem regardless of stability. Moreover, the reviewer mentions that the catastrophic forgetting refers to the absence of positive transfer, but this is not exactly true. The catastrophic forgetting refers to completely forgetting the previous task, and the positive transfer could occur in that case as well. ### Weakness 3: The baselines don't offer much insight into the experiments As mentioned above, since the fine-tuning does not have any emphasis on the stability, we believe the major baselines are the schemes that aim to address the plasticity loss of the fine-tuning method. But, as we observe, they turned out to not sufficiently address the negative transfer problem. ### Weakness 4: The proposed method requires task boundary information Thanks for the comment. Though we solely focused on the settings where the task boundary informations are always given, we can also consider the setting where the task boundary informations are not given or the task boundaries are vague. In those situations, if we can detect the task transition phase using the external taks identifier, such as DPM[1] or detecting plateaus in the loss surface [2], we can apply our methods to those settings. For further extension to more general CL scenario, we will consider this problem for future work. ### Weakness 5: The proposed approach is similar to [3] and [4] We disagree that our approach is similar to [3]. First, the role of centroid policy in [3] is to transfer the shared knowledge to task-specific policies. However, the offline actor in R&D tries not to transfer the knowledge to online actor. Second, as we mentioned in the Experiment section that the distillation technique that we used is also used in P&C , but we rather re-initialize the actor to prevent the transfer. Therefore, the distillation to centroid policy in [3] is rather similar to P&C, not ours. In case of [4], based on complementary learning system perspective, the proposed method in [4] resets the transient value function periodically to secure the plasticity of the network, and supports the permanent value function whose goal is to accumulate the baseline knowledge. Therefore, the main motivation behind using the transient and permanent value function lies on resolving the stability and plasticity dilemma. However, in R&D, the derivation on applying the reset is quite different. In our case, the main objective of resetting the online actor is to prevent the negative transfer which highly depends on the previously learned knowledge, not on the stability of the learner. What we want to emphasize that it is not straightforward to reset the network in CL scenario. To achieve remarkable performance in CL, it is important to preserve the knowledge on previous tasks. However, resetting the learner rather promote forgetting, which is a counterintuitive way. In [4], thanks to the permanent value function, resetting the transient value function does not harm the stability of the permanent value function. In our case, as we discussed in the ablation study, because we should reset all components to prevent the negative transfer, it is much more hard to apply resetting the learner while maintaining the knowledge on previous tasks. In terms of using the methods in [4] as baseline, we think that it is hard to apply the proposed method to Meta-World environment which consists of multiple continuous control tasks. The authors of [4] only carry out experiments using Q-learning variants with discrete action space, and extending this method to actor-critic based RL algorithms with continuous action space is out of our work. ### Weakness 6: Limitations of the proposed approach are not discussed, and only meta-world experiments are performed. Thank you for the comments on the limitations of our work. We already briefly mentioned our limitations in Conclusion, but we will clarify our limitations in the separate section. ### Question 1-1: Clarify the comment "effectively transfer the learned knowledge to a new task" First, we think that the comment "Plasticity is whether the agent can learn a new task" is same as the comment "effectively transfer the learned knowledge to a new task". The definition of the stability is an ability to preserve the knowledge on previous tasks. Therefore, transferring the learned knowledge is much more related to the plasticity, not to the stability. ### Question 1-2: isn't changing policy also a reason for shifts in data distribution and hence a loss in plasticity? Changing policy may cause the plasticity loss in CL scenario. The updated policy can derive the non-stationarity. If plasticity loss occurs, the learning agent cannot learn proceeding tasks due to the depleted capacity. However, in our results in Figure 1, though the agent completely failed to learn the second task, it can learn the third task, which means the capacity was not depleted. And the indicators on the plasticity loss is not consistent with the arguments from their original paper. Therefore, in our case, the contribution of the negative transfer problem on the performance degradation is more impactful than the plasticity loss. ### Question 1-3: The comment "negative transfer is just another version of plasticity loss" needs more explanation In line 140, we said that "`One may argue that the negative transfer occuring in CRL is just another version of plasticity loss`". In this comment, we just mentioned the similarity of two problem, and after this comment, we clarify that the methods for resolving the plasticity loss cannot resolve the negative transfer problem, which means those two problems are actually different. Therefore, we disagree that we strongly argue the comment "negative transfer is just another version of plasticity loss". If it is confusing, we will revise this comment more clearly. ### Question 2: How do behavioral cloning and R&D work if the state-space is different? If the state-space is different, we can consider two kinds of variations. One is using multiple input heads for each task. By leveraging the task id to choose the proper input head, we can selectively train and evaluate each task. The other version is using large state input, and do zero-padding at the remaining part if the dimension of the state vector is smaller than the size of the input. As a result, R&D can also be applied to the situations where the state-space is different. ### Question 3: How is the offline actor, $\theta_{offline}$, used? The offline actor $\theta_{offline}$ has two roles. One is obtaining the knowledge on the new task distilled from the online actor $\theta_{online}$. The other is performing the evaluation on tasks. All of the success rates of R&D in Figure 4 is from the offline actor $\theta_{offline}$. ### Question 4: Can the paper provide more clarity into the definitions of negative transfer and forgetting? The negative transfer is the performance degradation phenomenon of target tasks where the source and the target tasks are dissimilar. In CRL scenario, the source task corresponds to the previous tasks, and the target task corresponds to the new tasks. Therefore, in our work, we consider the inability on learning new tasks in CRL scenario. However, the forgetting is the performance degradation of previous tasks as the learning progresses. The main difference between the negative transfer and the forgetting is whether focusing on the *new* tasks or the *previous* tasks. For the forgetting measure, we refer to the Equation (3) in [5]. We think that it is proper to measure the forgetting based on the best performance that the learning agent achieves. If we instead use the expectation, when the performance gradually decreases as learning progresses, the forgetting measure also decreases. In this case, we cannot figure out how much the performance dropped compared to its peak performance. ### Question 5: Lines 346-347, column 1, are incomplete Thank you for pointing out our critical mistake in our manuscript. The original sentence was "`However, when the negative transfer occurs often ('Hard' and 'Random'), R\&D outperforms all the baselines.`". But we commented out the later part. We will revise this part in the camera-ready version. [1] Lee et. al., A Neural Dirichlet Process Mixture Model for Task-Free Continual Learning, ICLR, 2020 [2] Aljundi et. al., Task-Free Continual Learning, CVPR, 2019 [3] Teh et. al., Distral: Robust multitask reinforcement learning. NIPS, 2017 [4] Anand and Precup, Prediction and Control in Continual Reinforcement Learning, NeurIPS, 2024. [5] Chaudhry et al., Riemannian Walk for Incremental Learning-Understanding Forgetting and Intransigence, ECCV, 2018 ## After Rebuttal ### Weakness We gratefully thank you for your constructive comment. First, we want to clarify the fine-tuning in our experiment. In all experiments, we carry out retraining all the networks, not training the networks partially (e.g. retraining the last layer or other network layers). Usually, in CRL community, the fine-tuning method corresponds to retraining all layers of the network, and it is well known that just fine-tuning the whole network without injecting any stability can promote the plasticity of the network. Therefore, in our case, rather considering the stability of the network, we think that it is much feasible to compare the negative transfer with the plasticity loss. Second, the method in [3] can only applied to the multi-task learning setting where the tasks are not arrived sequentially. Therefore, since distilling the knowledge to the centroid policy in [3] is highly similar to the distillation procedure in which the active column distills the knowledge to the knowledge base, we think that the continual version of [3] is highly similar to P&C, and we already show that P&C cannot resolve the negative transfer in our setting. If there are more concerns, feel free to contact us! Thanks, Authors ### Response to the comment > Additional experiment results on `sweep-into --> push-wall --> sweep-into` As the reviewer has suggested, we have carried out the additional experiment for learning the sequence of tasks `sweep-into --> push-wall --> sweep-into`, of which figures are given in the following link. - Figures of results on fine-tuning PPO on `push-wall --> sweep-into` task sequence (top), and `sweep-into --> push-wall --> sweep-into` task sequence (bottom). https://icml2024submission2542.github.io/icml2024submission2542/sweep_push_sweep.pdf Firstly, we first sequentially fine-tuned (i.e., fully re-train the entire parameters as we mentioned in our rebuttal) PPO on the `push-wall --> sweep-into` task sequence (figure (top)). We can clearly observe that the negative transfer does **not** occur when we learn `sweep-into` after learning `push-wall`. Now, following the reviewer's suggestion, we sequentially fine-tuned PPO on `sweep-into --> push-wall --> sweep-into` task sequence (figure (bottom)). As the reviewer has mentioned, if the stability also plays a critical role in this setting, the knowledge on the first `sweep-into` can be transferred to the second `sweep-into`, and the learning of the second `sweep-into` could happen much quickly. However, as we can observe from the figure, the performance of the second `sweep-into` is rather decreased compared to learning the first `sweep-into` or the `sweep-into` in figure (top). This phenomenon is similar to the plasticity loss in [Abbas et. al.], which occurs when we revisit the same task after learning other tasks. We believe this result further corroborates that the stability is clearly **not the main factor** in our results with the full fine-tuning. [Abbas et. al.] Loss of Plasticity in Continual Deep Reinforcement Learning, CoLLAs, 2023. > Clarify the difference between P&C and Distral Thanks for the comment. Since those two approaches are similar in terms of using the knowledge distillation, we will make sure to clarify the difference of them in Section 4.2 of the final version of our paper.   sweep-into` task sequence. The figure (top) in the link below shows the results. In the top figure, we can see that the negative transfer does not occur when we learn `sweep-into` after learning `push-wall`. Now, to answer the question, we trained PPO on `sweep-into --> push-wall --> sweep-into` task sequence. As you mentioned, if the stability exists, the knowledge on the first `sweep-into` can be transferred to the second `sweep-into`, and the learning speed can increase. However, different from the expectation, the performance of the second `sweep-into` rather decreases. This phenomenon is similar to the plasticity loss in [Abbas et. al.], which occurs when we revisit the same task after learning other tasks. Therefore, **we think that there is no stability in the network.** For this reaseon, the full fine-tuning that we carry out can de-emphasize the stability and cause catastrophic forgetting -->  push-wall` task sequence (top), and `sweep-into --> push-wall --> sweep-into` task sequence (bottom). -->  --> ## Reviewer McJU (R3)  ### Weakness 1: Results on domains other than Meta-World Thank you for the constructive advice. We also believe that conducting experiments on other domains would further strengthen our contribution. However, given the computation and time constraints, it was very challenging for us to experiment on other environments, particularly those with visual domain (e.g. Atari or DMLab), under the continual learning setting where multiple tasks need to be learned sequentially. We instead chose to carry out extensive experiments on wide variety of tasks in Meta-World environment to convincingly validate our claim. As a future work, we will carry out further experiments on other environment as well. ### Question 1: Negative transfer between the same task group We agree that our results may seem counterintuitive. However, as also has been mentioned in the rebuttals for the [**Weakness 1/2 for Reviewer j23a**], our grouping of the tasks has been done loosely -- simply by grouping the tasks that share the same first part name -- hence the negative transfer can also happen within the group.  The main motivation of such grouping is to save the computational time for the experiments. However, the overall conclusion would still remain consistent compared to the exhaustive experiment on the 15 individual tasks. (Please also see the **comments for Reviwer j23a**.) ### Question 2: Negative transfer in image-based tasks(e.g. experiments on Meta-World tasks using images) Again, due to the computation and time constraints, we could not carry out the visual domain experiments. In addition, for Meta-World, we can convert the observation from vector-value representation to image representation. Unfortunately, doing this induces severe computational overhead for our research.  However, we strongly believe that similar issues will arise in image-based tasks as well. This conjecture stems from our belief that the knowledge accumulated in the network through learning previous tasks may negatively impact learning new tasks. ### Question 3: Success rate measured in Figure 4 Thanks for the question, and we apologize for missing some details on the metric. The average success rate at any step reported in Figure 4 stands for the average of the success rates measured for all 8 tasks, including those that have not been learned until that step. Therefore, in the most ideal scenario, namely, when the new task is learned perfectly and no forgetting is occured, the average success rate would be 1.0 * (number of tasks trained so far / total number of tasks). We apologize for the confusion, and we will add concrete description on our metric in the final version. ### Question 4: Why are both forward and reverse KL used in R&D? As mentioned in the manuscript, the difference between the two KL divergence terms is due to the fact that they originate from different concepts. We also considered unifying the orientation of the two KL terms, but decided to keep the original orientation to clarify the unique origins of each term. Apart from the origination, following the reviewer's suggestion, we additionally carried out the experiments varying the direction of KL in BC and knowledge distillation. In this experiment, let us denote the target distribution as $q$ and the learning distribution as $q_{\theta}$ in which $\theta$ is the parameters of the network. Then, the forward and reverse KL divergence is as follows: - Forward KL: $KL(q||p_{\theta})$ - Reverse KL: $KL(p_{\theta}||q)$ The figures below shows the results on various KL divergence losses for BC and knowledge distillation. - Figures of the experiment on various KL divergence losses. The success rates are averaged over 2 different random seeds. https://icml2024submission2542.github.io/icml2024submission2542/longseq_kl.pdf In this figure, we can see that there is no remarkable difference between 4 methods. Therefore, we think that deciding to keep the original version of our loss cannot make problems in our algorithm. ## Reviewer KuSE (R4) ### Weakness 1: Analysis of negative transfer relies solely on the results of 2-task experiments / The introduction of new hypotheses lacks in-depth discussion Thanks for the comment. Our initial aim was to verify that this phenomenon was due to plasticity loss, as previous research has linked the declining learning ability of agents during training to plasticity loss and sought to remedy it. To this end, we conducted experiments in section 3.1. However, there were no significant correlations found between the indicators identified by previous research as causes of plasticity loss and the observed phenomenon. The results in Figure 1 also indicate that there are instances where the agent successfully learns the third task despite encountering difficulty with the second task. This underscores that the observed phenomenon cannot be solely attributed to limitations in the network's capacity. Therefore, we conjectured that the cause of this phenomenon is not plasticity loss but negative transfer arising from the disparity between the source task and the target task from a transfer learning perspective. Indeed, we provide the two-task fine-tuning experiments in Section 3. It could be thought of the simplest setting for showing the negative transfer, but that simplicity is exactly the point. Namely, even in this simple set-up, we can demonstrate that (1) the negative transfer is not limited to specific cases but occurs quite frequently, and (2) the pattern of negative transfer varies significantly depending on the sequence of tasks. Such findings do generalize to our *long-task experiments* in Section 5, namely, previous solutions for Continual RL as well as plasticity loss cannot fully address the negative transfer issue when continually learning long sequence of tasks. In such situations, the simplest and most reliable way to train a new task is to completely remove the knowledge learned from previous tasks. Therefore, we opted to address negative transfer by resetting all parameters of the network to remove that knowledge. Furthermore, in section 5.4, we once again demonstrated the importance of completely eliminating prior knowledge in learning new tasks by utilizing P&C, a method similar to ours. We hope our comments give further clarification on our analyses and results. Moreover, it would be greatly helpful for us in improving our manuscript if the reviewer can specify which of our assumptions were questionable. ### Weakness 2: R&D is a direct application of transfer learning in CRL While our method may look simple, we respectfully disagree that R&D is a direct application of transfer learning. In fact, in order to prevent the negative transfer, we re-initialize the weights of the online actor and critic, hence, **no** transfer learning is happening. This may seem to be a limitation of our work, however, as we show in our comparison with P&C (Figure 6, Sec 5.4), when any kind of transfer learning mechanism is in play (e.g., re-using the weights from previous task or using the adaptor), the negative transfer problem cannot be successfully addressed. Thus, we argue that our contribution is to show that the reset and distillation steps are essential for preventing negative transfers in continual RL. We hope the reviewer can reassess the contribution of our paper after the our rebuttal. ### Weakness 3: Reset has been extensively discussed in prior RL papers As the reviewer has mentioned, the reset mechanisms indeed have been discussed in several prior RL research. However, as we have argued throughout our paper, it turns out that only partially resetting the network, as in P&C, ReDO and DrM [1], cannot fully address the negative transfer issue in Continual RL problem. In fact, completely resetting the weights then distilling the knowledge to the continual learning as in our R&D has not been considered before -- and we show that such steps are indeed essential for addressing the negative transfer as we show in the ablation study of Section 5.4. We believe this finding is certainly novel. Again, we hope the reviewer can reassess the novely of our work after our rebuttal, and we are eager to hear your opinion. ### Question 1: BC also performs well, why? Note the hardness of the sequences was determined by the existence of negative transfers among the tasks. Hence, when the sequence is easy, the main obstacle for achieving high average success rates is the catastrophic forgetting, which can be handled by ClonEx (BC). Thus, it is not very surprising that ClonEx(BC) also performs well. However, it is clear that it fails for the Hard sequence as can be seen in Figure 4. ### Question 2: How would increasing the number of tasks affect R\&D? We think this is a nice variation of our setting that is worth experimenting with. While we have not experimented with it ourselves, studies such as [2] have demonstrated the effectiveness of knowledge distillation in multi-task reinforcement learning. Therefore, we also expect R&D to work well in multi-task situations. We will defer this study to our future work. ### Question 3: Lacks a discussion on some relevant published works. Thanks for suggesting a related work [1]. At the submission of our manuscript, we were unable to follow up on this work since it was under review for ICLR 2024 and is just recently accepted. Reviewing the paper, we believe [1] is an application version of ReDo (Sokar et al., 2023) that utilizes the effectiveness of re-initialized network for the dormant neurons to promote the exploration ability of an agent. Since the method in [1] is highly similar to ReDo, we expect that it will also suffer from negative transfer problem like ReDo,as shown in Figure 1. Following the reviewer's suggestion, we will also add [1] as our related work in Section 2. ### Question 4: The authors need to discuss more about the novelty and motivation behind the R&D As we have argued in above rebuttals for the Weakness comments, we summarize the main contribution of our work one more time as follow: - We revealed instances in which tasks unexpectedly fail to be learned in CRL, attributing this phenomenon to negative transfer rather than plasticity loss. - Through comprehensive experiments, we illustrated the frequent occurrence and diverse patterns of negative transfer depending on task sequence. - We introduced Reset & Distill (R&D), a simple yet effective algorithm designed to mitigate negative transfer. The main motivation of our work is to resolve the performance degradation problem on learning novel tasks in CRL setting. We hypothesize that the main cause of this phenomenon can be two problems: one is the plasticity loss which can arise because of the non-stationarity of the target function, and the other is the negative transfer which can arise when the source task (previously learned task) and the target task (newly arrived task) is dissimilar. In our experiments in Figure 1, we point out that using the indicators proposed in previous works, the performance degradation phenomenon on second task (`push-wall`) cannot be explained. Furthermore, since both algorithm (SAC and PPO) are able to learn the third task (`window-close`) smoothly, we think that this phenomenon also cannot be explained by the plasticity loss in which the network capacity is depleted. [1] Xu et al., DrM: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization, ICLR, 2024 [2] Teh et al., Distral: Robust Multitask Reinforcement Learning, NeurIPS, 2017

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.