(R&D) ICLR 2024 Rebuttal

## Global comments for all reviewers We deeply apologize for the delayed response to your reviews, which regrettably occurred due to personal reasons. Your understanding in this matter would be greatly appreciated. ### Regarding Remark 1 > **Remark 1.** The rationale for our method is that since the reward signals are not strong enough to always adapt the pre-trained actor and critic networks to the current task … - We are not suggesting that there is something wrong with the reward function itself. In a traditional supervised setting, the effects of negative transfer are not as prominent as in RL. The main difference between the two is that a clear true target is used in supervised setting, whereas a target based on the reward signal is utilized in RL. Therefore, our hypothesis is that the reward signal in reinforcement learning is not sufficiently potent to rectify inadequate initialization. - Although we intended this to be the case, we agree that the statement above is misleading. We will modify it. ### Regarding the interpretation of Table 1 There were concerns raised about the absence of a significant difference in the results between R&D and ClonEx in Table 1. We, however, disagree that the results of R&D are merely on par with the other baselines. Intuitively, the negative transfer measure we defined can be thought of as the approximate percentage of tasks in which a negative transfer occurred out of all 8 tasks. While this is less pronounced in the Easy sequence, in the Hard sequence even the best performing baseline has a value above 0.5. This means that more than half of the tasks in the entire sequence are not trained at all. Our method, on the other hand, has values close to zero regardless of the sequence, and even some negative values (although we believe this is due to chance). As you can see from Figure 3, our method learns all tasks in the sequence well. ### Regarding the difference between the negative transfer and capacity/plasticity loss - In case of the results of Figure 1 in our manuscript, we clarify the interpretation on the difference between the negative transfer and plasticity / capacity loss. First, in [1], when we fine-tune the value (or Q function) network, the network gradually loses its capacity on fitting new targets. And as a result, it cannot fit the reward signals given from the environments. In Figure 1, the success rates of second task are 0 during the whole training steps, and based on the interpretation on the capacity loss, all remaining capacity in the value network has been depleted. Therefore, we expect that it is hard to learn the following tasks because of the depleted capacity. However, unlike our expectation, the agent can learn the third task. We think that this phenomenon cannot be explained through the capacity loss. If the capacity has not been depleted, the agent can learn the new task, and vice versa. However, our results are not belonging to both cases. In case of the primacy bias [2] which is highly similar to the capacity loss, if we assume that the agent is highly biased toward the data collected from first task, it cannot learn all the following tasks. However, since the third task can be learned using SAC, we can also conclude that this phenomenon also cannot be explained through the primacy bias. - Based on the detailed explanation above, we stress that the interpretation behind the capacity loss or primacy bias cannot explain our findings, and we think that the negative transfer is different from previously proposed problems. ### Regarding the ***critical*** typo in the caption of Figure 2 We sincerely apologize for the critical typo. In the caption of Figure 2, we used 10 random seeds for this experiment. Not 3 random seeds. We revised this part in the current version. [1] Lyle et. al., “Understanding and Preventing Capacity Loss in Reinforcement Learning”, ICLR, 2022 [2] Nikishin et. al., “The Primacy Bias in Deep Reinforcement Learning”, ICML, 2022 ## Comments for Reviewer 274b ### Question 1: Convergence of Q learning - First, there is a difference between tabular setting and deep RL initialization. While tabular initializes the output of the Q function directly, initialization of the Q network in deep RL means initialization of the network parameters. - This difference is even more pronounced for Q function updates. In the tabular setting, we directly change the output of the Q function according to the Bellman Equation, but we define a loss function and update the Q function by gradient descent in deep RL. - As far as we know, convergence in deep RL is not as well understood as in tabular settings. Therefore, a direct comparison between tabular RL and deep RL is somewhat inappropriate. For this reason, we did not address the theoretical convergence in deep RL. However, we have shown through extensive experiments that learning in deep RL can be affected by initialization, and it is not simply due to the loss of plasticity or capacity of the network. ### Question 2: Question about Remark 1 - Please refer to the global comment for a response to Remark 1. - We also think that training time can be an important point. In many works, such as [1], the number of frames used per task in the Meta-World is set to 1M. In order to exclude the situation of running out of frames as much as possible, we set the number of frames per task to 3M, and then conducted the experiment by selecting only the tasks that can be learned well within 3M from scratch with 10 random seeds. Given that these tasks are effectively trained within 3M steps using well-known RL algorithms such as SAC or PPO, it is difficult for us to regard this as an issue. - Regarding the acceleration technique, we first thought that the policy trained on the previous task is not good at exploration, so we tried to train it by adding an auxiliary loss that forces it to some extent. However, this method cannot solve negative transfer. - Also, the ClonEx[1] that we adopted as a baseline not only utilizes behavioral cloning to prevent forgetting but also applies several exploration techniques to enhance learning efficiency. However, as evident from our experimental results, these methods did not effectively address negative transfer. - Thus, in our opinion, currently known acceleration techniques are not designed with negative transfer in mind, so it is unlikely that they can fundamentally solve it. ### Question 3: The proposed algorithm is too straightforward and is highly based on behavioral cloning The main contribution of our method is to reset the network before learning each task to avoid negative transfer. We also applied knowledge distillation to transfer this information to the continuous learner. It may seem similar to behavioral cloning, but it is only used as a means to an end. What we want to emphasize is that even this simple method can effectively alleviate negative transfer. Also, although our method takes two processes to learn the current task, it is not a big problem because the process of distilling the knowledge of online actors (learners from scratch) to offline actor (continual learner) requires very little time compared to learning online actors. ### Question 4: What is the difference between the catastrophic negative transfer and plasticity? We have presented a comprehensive analysis of the distinctions between negative transfer and capacity/plasticity loss. Your reference to this analysis would be greatly appreciated. ### Question 5: The empirical study on this environment may not generalize to broader environment We could have drawn even stronger conclusions if we had conducted experiments on different environments. However, MetaWorld itself utilizes a substantial number of tasks, and notably, all tasks share the same state space and actions (although the distribution of states may vary). We chose to conduct experiments in MetaWorld because we believed that these characteristics most clearly highlight the issue of negative transfer that we aim to address in continual RL. ### Question 6: Why CReLU is not useful in Figure 4? Through our experiments, we confirmed that learning a single task becomes challenging when applying CReLU. You can observe this from the results labeled 'From Scratch'. We believe that the observed performance degradation is a result of the difficulty in learning the single tasks. While the exact aspects causing these differences have not been identified, exploring them falls outside the scope of our investigation. ### Question 7: It seems R&D is competitive with ClonEx without significant improvement The results in Table 1 were negative transfer and forgetting measures computed based on the results shown in Figure 3 and 5. Therefore, a total of 10 random seeds were employed. Additional explanations regarding the interpretation of the results have been provided in the global comments; please refer to them for further clarification. ### Question 8: The considered baselines are limited and behavioral cloning is not superior to other state-of-the-art continual RL baselines - Existing continual RL works primarily focus on addressing catastrophic forgetting. To the best of our knowledge, ClonEx[1] is currently one of the best performing continual RL algorithms, and it effectively addresses the forgetting problem, as shown in our experimental results. Forgetting is minimal in the results when ClonEx is applied, making the need for additional baselines to compare forgetting unnecessary. - Furthermore, ClonEx considers not only behavioral cloning but also incorporates several exploration techniques to account for forward transfer. However, we deemed this alone insufficient as a baseline for addressing negative transfer. Therefore, we augmented ClonEx with solutions proposed in InFeR and CReLU, addressing capacity/plasticity loss, as additional baselines. While InFeR and CReLU contribute to improving ClonEx's performance to some extent, as observed in Figure 5 and Table 1, they do not provide a fundamental solution. - As mentioned earlier, the most significant contribution of our method is the reset of parameters before learning each task to avoid negative transfer. Our method does not simply apply behavioral cloning, and it can effectively address both forgetting and negative transfer. [1] Wolczyk et. al., "Disentangling transfer in continual reinforcement learning", NeurIPS, 2022. ## Comments for Reviewer u4tG ### Weakness 1: The innovation is not enough and other works show that resetting works for continual learning We appreciate the comments and would like to respond; however, we are facing challenges due to the limited availability of specific details. As far as we know, ClonEx introduced behavioral cloning (KL divergence) to prevent forgetting, but it does not address negative transfers. If there are other works with parameter resetting and KD that we missed, we would really appreciate it if you could let us know. ### Weakness 2: Figure 2 and 4 should be compared precisely We believe you had this question based on the following statement in Section 5.2 of our manuscript. > The findings reveal that fine-tuning with CReLU and InFeR yields similar success rates when compared to the results presented in Figure 2. What we wanted to emphasize with Figure 4 is that the solutions designed to address traditional capacity/plasticity loss, such as InFeR and CReLU, do not address negative transfer. For this purpose, we believed it was more important to compare the results with learning a single task from scratch rather than with finetuning (Figure 2). We have recognized that the wording in the manuscript could be potentially misleading, so we will fix it. ### Weakness 3: There is no evidence or proof for Remark 1 For the response to Remark 1, please refer to the global comments. ### Weakness 4: How is the offline actor trained? First, a randomly initialized network (online actor) is used to learn each task. We use a known RL algorithm (SAC and PPO in our case) for training. In the process of training the online actor, there will be a replay buffer stored. If you use an on-policy algorithm such as PPO, you need to save the replay buffer after training. The states and actions in this stored replay buffer contain the knowledge of the online actor. By distilling this knowledge to the offline actor, the offline actor learns. ### Question 1: What is the beneficial use of R&D compared to just training each task from scratch? If we train each task with a re-initialized network, it will forget everything learned in the previous tasks, even though it can learn the current task well. Continual learning is not just about effectively learning the current task; it is crucial not to forget the content of previously learned tasks. Simple resetting alone cannot preserve performance on tasks learned earlier, so additional mechanisms are necessary. We solved the forgetting problem by adding behavioral cloning of the previous tasks to the knowledge distillation for the current task. ### Question 2: The results of Table 1 are very close and it is not sufficient to compare just the numbers Additional explanations about the interpretation of the results of Table 1 have been provided in the global comments. So please refer to them for further clarification. ### Question 3: What is exactly the expert buffer? Let's consider learning tasks A→B sequentially. The policy immediately after learning task A can be seen as an expert for task A. To retain the learned information at this point, behavioral cloning in continual learning involves storing the actions of the expert for states in task A in a buffer. Later, when learning task B, regularization is performed using the state-action pairs in this buffer. In summary, the buffer containing state-action pairs for the previously learned task is referred to as the expert buffer. ### Quenstion 4: In the baselines, it would be worth to compare with just resetting agent or training an agent from scratch. As mentioned earlier, simply resetting the agent for each task is not effective in continual learning. While this method may lead to effective learning of the current task, it results in a significant loss of information for previously learned tasks, ultimately leading to very poor performance. ### Question 5: What is the evidence for Remark 1? Please refer to the global comment for a response to Remark 1. ## Comments for Reviewer oDoz ### Weakness 1: Why is R&D not compared with CReLU or InFeR? Thank you for the valuable suggestion. We included the results from R&D in the Supplementary Materials. In this result, R&D effectively resolves the negatvie transfer. Though some tasks (e.g. the tasks after learning the tasks in 'Sweep group') are still hard to be learned in SAC, now the tasks in 'Sweep' group can be learned when those tasks lie in the second task. Furthermore, many tasks do not suffer from negative transfer in PPO. ### Weakness 2: Suggestions about writing and visualization - Firstly, apologies for the confusion caused by the typo. We also used 10 random seeds in the experiments for Figure 2. - We will take into account the other suggestions later on. Thank you for the feedback. ### Weakness 3: Why has the p-value not been provided? - While we did not directly provide p-values in the Supplementary Materials, we have provided enough information through standard deviation for some inference. - It seems that the term "statistical significance" may have caused some misunderstanding. We will make corrections to clarify this. ## Comments for Reviewer p3gB ### Weakness 1: The novelty of the negative transfer problem in CRL - As we already mentioned in the Introduction section, we agree that the negative transfer has been studied in a similar way like loss of plasticity or capacity loss. However, our experiments have shown that a proper solution for resolving the negative transfer in CRL setting has not been proposed in other works. Either CReLU or InFeR cannot solve this problem, and we stress that this problem cannot be explained via plasticity or capacity loss. Furthermore, as reviewer p3gB mentioned, this problem is not extensively explored in previous methods, and we think that we extensively explored the negative transfer problem in Meta-World environment. - For the more in-depth analysis about the distinction between negative transfer and capacity/plasticity loss, please refer to the global comment. ### Weakness 2: The proposed method just sidesteps the negative transfer problem As mentioned in the conclusion, we agree that our method is not capable of positive transfer. However, addressing negative transfer is a prerequisite before contemplating positive transfer. Despite previous studies, even in the case of P&C, attempting to achieve positive transfer, our experiments revealed that such efforts did not effectively tackle negative transfer. We have demonstrated that our method outperforms existing baselines by solely addressing negative transfer, without explicitly incorporating positive transfer in a straightforward manner. Hence, we assert that our contribution lies in this aspect. ### Weakness 3: There is a critical flaw in the implementation of P&C We acknowledge that we were not aware of the reference to a reset in P&C. We have further experimented with the implementation of P&C. These results are available for review in the Supplementary Materials of the revised manuscript. Our conclusion is that we cannot agree that P&C solves the negative transfer issue. In our experimental setup, we implemented the algorithm proposed by P&C as it is, attaching an adaptor to the policy and receiving features from the knowledge base policy(`policy_kb`). When learning a new task, we reset the policy and the Q function (or the value function), and proceeded with learning again. Note that when learning second task, the adaptor is trained from randomly initialized network, and the third task is learned with pre-trained adaptor from the second task. In the result, we can see that the task is not learned at all. We think this is due to the part where the information from `policy_kb` is passed through the adaptor. It might seem that applying resetting to P&C would solve the negative transfer, but in our experiments, we found that it did not solve the negative transfer at all. Therefore, we can see that P&C is not able to maintain positive transfers and is also adversely affected by negative transfers. ### Weakness 4: There is no distinct Related Work section Our intention was to have the Introduction section serve as an introduction to the related work, but this seems to have been confusing. We will add a new Related Work section to describe more about works such as transfer learning later on. ### Weakness 5: The choice of tasks that can be trained within 3 million steps is well suited only for an algorithm that achieves no forward transfer We focused our experiments on the occurrence and resolution of negative transfers. We excluded tasks that were not learned within 3 minutes of the CRL because they would not allow us to see the extent of negative transfer. Though it would be better to consider more hard and complex tasks to show the forward transfer in our experiment, we think that resolving the negative transfer is our major challenge, and achieving the forward transfer remained as future work. ### Question 1: Questions regarding the Introduction - Firstly, the results in Figure 1 are from conducting only finetuning. We are highlighting the issue of negative transfer persisting even in the simple finetuning scenarios. Similar concerns were previously raised in the context of capacity/plasticity loss. - Secondly, I think what you're saying is somewhat similar to what we're trying to say. This is because as the agent learns, it gets better at some tasks and worse at others as it fits into one task. In this case, from the network's point of view, the only difference from learning from scratch is the parameter initialization. This is the reason we directed our attention towards the re-initialization of the network. ### Question 2: Our work roughly boils down to the idea of [1] - [1] discusses the difficulty of learning two tasks with the same state space and opposing behaviors. While there may be some similarity in addressing dissimilarity between tasks, there are significant differences in our work. - To simplify, let's consider the scenario of sequentially learning two tasks, A and B, with the same state space and opposing target behaviors in the order of A→B. [1] does not address the situation where learning B is hindered by A. Instead, it focuses on situations where, as B is well-learned, A is forgotten since it requires opposite actions. In contrast, the negative transfer proposed in our work deals with a scenario where learning B becomes impossible due to prior learning of A. This is not simply because A and B target opposite behaviors; our experiments demonstrated the inability to learn B even when all information about A was lost during fine-tuning without any regularization. - Such a problem cannot be resolved by methods like OWL mentioned in [1]. OWL, in brief, attaches heads corresponding to each task to the network and applies EWC or other continual learning methods to the shared parameters. In fact, all of our experiments use multi-head agents, but negative transfer is still not solved. - Additionally, interference due to opposite behavior, as mentioned in [1], intuitively seems plausible but somewhat contradicts our experimental results. For example, when sequentially learning 'window-close' and 'window-open,' which have the exactly same states and opposite goals, we observed successful learning without negative transfer. ### Question 3: Why Behavioral Cloning is mentioned in Section 2? Since we mainly use the Behavioral Cloning (BC) proposed in [2] to prevent forgetting in our method, we explain the details on BC as a preliminary work. If it is awkward to metion BC in this section, we will write additional brief introduction of BC in the Related Work section. ### Question 4: Why are the first words supposed to denote whether tasks are similar? We determined the similarity of the task based on the similarity of the visualization of the states and the task characteristic rather than the similarity of the action. For example, the tasks in 'faucet' group are 'faucet-close' and 'faucet-open'. The type of those tasks is just turn the 'facet' to the left or right. Though the 'button-press' and the 'coffee-button' may be similar, they just share the **actions**, not the overall type of tasks and the states. ### Question 5: Questions about task groups - First, the number of tasks in each group varies across all task groups, and all task groups contain at least 2 tasks. We will clarify this part in our camera-ready version. - For the results on the 13 tasks, we only trained two tasks sequentially, and our intention was to visualize the negative transfer results between two individual tasks, not the task groups. Furthermore, the interpretation on this results is not different from the group-wise results. Those resutls are just task-wise view of the negative transfer. ### Question 6: Questions about Figure 3 and 5 - To begin with, the results depicted in Figure 3 and 5 represent the averaged success rate across all 8 tasks. - As you understood, the average success rate of R&D provided in Figures 3 and 5 represents the performance of the offline actor. It is true that the offline actor remains unchanged during the 3M steps in which the online actors are trained. However, if the success rate of the offline actor is represented as a step function according to the environment frames as you said, there is room for interpretation that the offline actor is used in the process of learning individual tasks. That is why we put markers at each point where the offline actor is trained and connected them with lines for visibility. - Furthermore, even though the offline actor is not trained during the process of learning individual tasks, this does not mean that there is no information gained while training the online actor. If we train the online actor for only 1M steps, which is 1/3 of the original training time, and then distill it to the offline actor, there would be a corresponding performance improvement to 1M. Therefore, although the training of the offline actor occurs in a discrete manner, we believe that the improvement in performance can also be interpreted as somewhat continuous. ### Question 7: The use of "long task sequence" refer to sequences of 8 tasks is underwhelming. Our intention regarding the "long sequence" is to present results for a greater number of tasks than discussed in the previous chapter, where we covered experiments with two tasks. While the sequences of length 8 may fall short of what you consider sufficient, we believe it was adequate to showcase the results we aim to provide. ### Question 8: "To check that either CReLU or InFeR ..." - what does this mean? The meaning behind this comment is to show whether the CL methods equipped with CReLU or InFeR can resolve both catastrohpic negative transfer and forgetting in the long sequence experiment. If it is confusing, we revise this comment more clearly. [1] Kessler et. al., "Same State, Different Task: Continual Reinforcement Learning without Interference", AAAI, 2022. [2] Wołczyk et. al., "Disentangling Transfer in Continual Reinforcement Learning", NeurIPS, 2022