DISCO-DANCE NeurIPS 2023 Rebuttal

Dear Reviewer gakY, We appreciate your feedback regarding the need for comparisons with algorithms such as DADS, MUSIC, and LSD. Initially, as discussed in our response, we classified skill discovery methods into two broad categories: 1. Algorithms that utilize $r_{skill}$ with a mutual information objective while focusing on constructing an effective auxiliary exploration reward $r_{exploration}$. 2. Methods that seek to redefine $r_{skill}$ beyond mutual information. In our earlier communication, we positioned DISCO-DANCE in the former category, believing that its closest counterparts were the algorithms targeting the exploration reward. This informed our initial decision to exclude DADS, MUSIC, and LSD from our comparisons, as they were primarily aligned with the latter category. However, upon your feedback and deeper reflection, we recognize the value of broadening our comparative analysis. Even if DADS, MUSIC, and LSD primarily fall into the second category, these algorithms may overcome the inherent pessimism associated with the mutual information objective. In response, we have decided to extend our comparative analysis with LSD to ensure a comprehensive evaluation of skill discovery algorithms. Among those, we selected LSD, the algorithm known as the top performer among the reviewer has suggested. Due to time constraints, we narrowed our focus to the Ant-$\Pi$ maze, the environment where the agent suffer from exhaustive exploration. | $r_\text{skill}$ | $r_\text{exploration}$ |state coverage | | -------- | -------- | -------- | | DIAYN | None | 22.50±3.34 | | DIAYN | DISCO-DANCE | 39.00±4.85 | | LSD | None | 38.80±3.34 | | LSD | DISCO-DANCE | 45.80±3.34 | The above table illustrates the performance comparison with LSD. A noteworthy observation is that, unlike in the 2D-bottleneck-maze (Fig.c in attached pdf), both DISCO-DANCE and LSD exhibit comparable performance in the Ant-$\Pi$ maze. This potentially underscores LSD's capability to mitigate the inherent pessimism of mutual information objective. While the performance of LSD is noteworthy, it doesn't diminish the significance of DISCO-DANCE. This is because $r_{skill}$, when decoupled from the mutual information objective, and $r_{exploration}$ as guided by DISCO-DANCE, can work in tandem, complement each other. We tested this by combining DISCO-DANCE with LSD, which led to significant performance improvements. Thus, in environments that are challenging to explore, DISCO-DANCE can serve as a key role in improving exploration. In summary, we acknowledge that our initial experiments primarily utilized MI-based algorithms, specifically DIAYN. Yet, as shown in our additional experiments, DISCO-DANCE not only outperforms LSD in both the 2D-bottleneck-maze and Ant-$\Pi$ maze, but also exhibits augmented performance improvements when integrated with LSD. We are also in the process of conducting experiments with MUSIC to provide a more comprehensive evaluation across various environments. In our revised manuscript, we will incorporate these results, along with additional ablation studies where DISCO-DANCE is utilized on top of different skill discovery objectives. We hope our response alleviates your concerns. Best, Paper 12428 authors --- Dear Reviewer gakY, Previous skill discovery algorithms attempts to address the inherent pessimistic exploration problem in MI based method (described in Section 2.1). These strategies can be categorized into two principal classes: (i) one concentrates on formulating an effective auxiliary exploration reward which encourages the exploration of skills (i.e., $r_\text{exploration}$) and (ii) the other focuses on modifying the MI based skill learning objective (i.e., $r_\text{skill}$). DISCO-DANCE belongs to the first category. Therefore, throughout our manuscript, we fixed $r_\text{skill}$ with a MI based method (i.e., DIAYN) and focused on comparing DISCO-DANCE against baselines which also aim in modeling an effective $r_\text{exploration}$. However, the reviewer requested a comparison between DISCO-DANCE and methods which we believe to fall within the second category. In order to alleviate the reviewer's concerns regarding the effectiveness of DISCO-DANCE against suggested baselines, we performed additional experiments in order to verify the effectiveness of our approach against previous methods which the reviewer pointed out. However, due to the limited amount of time, we were only able to perform additional experiments for LSD, which is the best-performing algorithm among the methods recommended by the reviewer and found that DISCO-DANCE outperforms LSD in 2D bottleneck maze (as shown in Figure $c$). We would also like to point out that since DISCO-DANCE attempts to improve $r_\text{exploration}$ and LSD attempts to improve $r_\text{skill}$, DISCO-DANCE and LSD are not mutually exclusive. It is possible to apply $r_\text{exploration}^\text{DISCO-DANCE}$ with LSD and we provide empirical proof that such learning objective is superior to the naive MI based approach (i.e., DIAYN). Therefore, to demonstrate if DISCO-DANCE can be used as an add-on-module on top of any arbitrary $r_\text{skill}$ methodology, we compared DIAYN, DIAYN + $r_\text{exploration}^\text{DISCO-DANCE}$, LSD, and LSD + $r_\text{exploration}^\text{DISCO-DANCE}$ in the Ant-$\Pi$ maze. | $r_\text{skill}$ | $r_\text{exploration}$ |state coverage | | -------- | -------- | -------- | | DIAYN | None | 22.50±3.34 | | DIAYN | DISCO-DANCE | 39.00±4.85 | | LSD | None | 38.80±3.34 | | LSD | DISCO-DANCE | 45.80±3.34 | As shown in the table above, $r_\text{exploration}^\text{DISCO-DANCE}$ benefits for both $r_\text{skill}$ methods. Moreover, using better $r_\text{skill}$ with $r_\text{exploration}^\text{DISCO-DANCE}$ achieves the best performance. Of course, we acknowledge that due to the short time frame of the rebuttal, we were unable to compare different $r_\text{skill}$ methods and therefore have limited experimental support. We will add more comparisons with other algorithms such as MUSIC for our final manuscript. In summary, there might be apprehensions regarding the robustness of DISCO-DANCE because all of our experiments were conducted on top of MI-based algorithms (i.e., DIAYN). However, as evident from the aforementioned experiment, it is apparent that DISCO-DANCE can also enhance the performance of different $r_\text{skill}$ methods such as LSD. Thus, considering its capacity to promote effective exploration, it is arguable that DISCO-DANCE holds considerable merit. We hope this additional experiment will address your concerns well, and welcome any additional comments and clarifications. Best, Paper 12428 authors ----- 기존 skill discvoery의 연구들은 a,b,c와 같은 mutual information기반의 skill discovery objective에 추가적인 exploration objective를 쓰는 방법론들과 MUSIC, LSD와 같은 mutual information에 기반하지 않은 skill discovery objective를 사용하는 방법들로 이루어질 수 있습니다. 저희의 연구에서는 skill discovery objective를 mutual information으로 고정해두고 더 좋은 r_exploration을 만드는데 집중을 하였습니다. - [ ] 이에 reviewer께서는 mutual information을 사용하지 않는 skill objective와의 비교를 요청해주셨었습니다. - [ ] 이에 저희는 initial response에서 mutual information을 objective로 사용하지 않는 skill discovery방법론 중 가장 성능이 좋은 LSD를 기준 알고리즘으로 선정하여 2d bottleneck maze에서 실험을 진행해보았고, disco-dance보다 성능이 좋지 않은 것을 확인할 수 있었다고 말씀 드렸습니다. 비록 initial response에서는 DISCO-DANCE가 LSD보다 bottleneck과 같은 hard exploration에서 좋은 결과를 만들었다고 말씀 드렸지만, 이러한 결과가 둘이 상호 배타적 (mutually exclusive)하다는 뜻은 아닙니다. r_skill로서 LSD와 같은 mutual information보다 더 나은 Objective를 쓰고 r_exp로서 DISCO-DANCE와 같은 hard exploration에 강한 알고리즘을 사용한다면 추가적인 benefit을 가져다 줄 수 있을것이라 생각합니다. 따라서, DISCO-DANCE가 LSD와 같은 방법론 위에 add-on-module로서 사용될 수 있는지 확인해보기 위해서 , 2d bottleneck maze와 ant pi maze에서 r_skill로 MI와 LSD를 그리고 r_exp로 아무것도 사용하지 않는 것과 DISCO-DANCE를 사용하는 것을 조합해 가며 실험을 진행해 보았습니다. Table 1. State coverage | $r_\text{skill}$ | $r_\text{exploration}$ | Ant-$\Pi$maze | | -------- | -------- | -------- | | DIAYN | None | 22.50±3.34 | | DIAYN | DISCO-DANCE | 39.00±4.85 | | LSD | None | 38.80±3.34 | | LSD | DISCO-DANCE | 45.80±3.34 | 결과는 blah blah. 그렇기 때문에 add-on-module로서 좋다. 물론 두가지의 환경에서만 실험을 진행해보았기에 실험적 뒷받침에 있어서는 부족함이 있을 수 있다는 것을 인정합니다. 하지만, 가장 어려운 환경 2개에서 했는데 다른 환경에서도 당연히 잘될거라고 확신함. 현재 rebuttal 기간동안은 진행하지 못했지만, revised된 manuscript에는 말씀해주셨던 MUSIC을 포함하여 MUSIC과 LSD와의 comprehensive한 comparison과 MUSIC, LSD에 각각 DISCO-DANCE를 얹었을때의 실험 결과를 보일 것을 약속합니다. 결론적으로, 비록 DISCO-DANCE가 mutual information기반의 skill discovery reward위에서 실험을 진행하였기에 robust하게 좋은 skill discovery알고리즘인지에 대한 걱정이 드실 수 있으나, Ant Maze와 2D maze에서의 LSD의 결과를 DISCO-DANCE를 추가적으로 크게 올려주는 것을 미루어 볼 때, 추가적인 skill exploration을 장려하는 DISCO-DANCE의 알고리즘은 충분히 좋은 가치가 있다고 생각합니다. 감사합니다. ----  1. AC 한테 소통하게 해줘서 감사합니다. 2. 우리가 앞선 답변, LSD 추가실험에 더불어서, $r_\text{exploration}^\text{DISCO-DANCE}$ 가 $r_\text{skill}$ LSD 등 과도 결합했을 때 시너지를 가질 수 있음을 입증하기 위해 추가실험을 진행하고 있고, discussion 이 끝나기 전에 (내일까지) 결과를 업로드 하겠다. Dear Reviewer gakY, Regarding the comparison with LSD, in addition to our prior LSD experiments on 2D bottleneck maze, we are in the process of organizing an extra experiments on the Ant-$\Pi$ maze to demonstrate the robustness of $r_\text{exploration}^\text{DISCO-DANCE}$. We hope this additional experiment will address the Reviewer gakY's concerns well. We intent to share the outcomes until Aug 20th 2am EDT. Best, Paper 12428 authors 2d: DIAYN << LSD << LSD + DISCO-DANCE < DIAYN + DISCO-DANCE Ant: DIAYN << LSD == DIAYN + DISCO-DANCE << LSD + DISCO-DANCE **Response to all reviewers** We deeply appreciate all five reviewers for their thoughtful feedback and valuable suggestions. R1, R2, R3, R4 and R5 indicate reviewer gakY, reviewer adLT, reviewer b566, reviewer EWQN, reviewer oWju respectively. Reviewers identified the following strength in our submission: - The main idea is intuitive and well-motivated (R2, R3, R4, R5). - The proposed method was evaluated against several methods in various domains, and the empirical results are promising (R2, R4, R5). At the same time, reviewers identified the following weaknesses in our submission: - Scalability of Random Walk Process to high dimensional state spaces (R1, R3). - Missing baselines and missing experiments in manipulation environments (R1). - Sample inefficiency of Random Walk Process (R4). We hope our responses address all reviewer’s concerns, and we welcome any additional comments and clarifications. **General Response: Random walk process in high-dimensional state space** Reviewer R1 and R3 raised questions regarding the potential scalability of our random walk process within high-dimensional state spaces. To address this, we've identified two essential questions to consider: 1. When the agent starts to explore from an arbitrary terminal state, can the agent visit a diverse range of states through the random walk process? 2. Given the diverse range of visited states, is it possible to identify the terminal state (guide skill) within the least explored region? In response, we conducted a synthetic experiment on Montezuma's Revenge, which is a high-dimensional pixel-based environment characterized by an 84×84×3 input and is well-known for its exploration challenges. Figure (a) in the attached PDF provides a visual illustration of our random walk approach within Montezuma's Revenge. In this experiment, we: 1. Randomly reset the agent in the inital room of Montezuma's Revenge. 2. Execute a random walk for 100 steps from the randomly initialized state. 3. Iterate the step 1,2 for N cycles. This experimental design mirrors the algorithmic design of DISCO-DANCE where each reinitialized point indicates the terminal state of a different skill. Correspondingly, each skill undergoes a random walk for 100 steps, aligning with the parameters used for our paper (i.e., P=N, R=100, M=1 in Equation 4). >1. When the agent starts to explore from an arbitrary terminal state, can the agent visit a diverse range of states through the random walk process? As illustrated by Figure (a), even in such a high-dimensional state space, the agent was able to visit a variety of states within just 100 random steps. For instance, for Skill 'I', the agent was able to move to another room. And with for Skill 'J', the agent successfully picked up the key. This experimental result supports the versatility of the random walk process, even when the environment is high-dimensional. > 2. Given the diverse range of visited states, is it possible to identify the terminal state (guide skill) within the least explored region? After random walk, our aim shifts to identifying the guide skill within the least explored area. For density estimation, we employed the cell-centric technique from Go-Explore [1]: (i) segmenting the aggregated states into discrete cells, (ii) count the number of each cell's visitation, and (iii) select the least visited cell. As emphasized in Figure (a), the cell marked in red is selected as guide skill, indicating that skill 'I' is the prime candidate for guide skill in DISCO-DANCE. This combined approach of exploration through random walk then identifying unique states (which is used in DISCO-DANCE), parallels the approach adopted by Go-Explore. This technique has consistently demonstrated its efficacy across varied domains, including Atari games and robotic settings. In summary, we believe our synthetic experiment affirms the scalability of our random walk process in DISCO-DANCE, even within a high-dimensional pixel-based environment. In the revised manuscript, we'll be integrating these insights into the 'Limitations and Future Directions' section. [1] First return, then explore. Ecoffet et al., Nature 2021. **Dear reviewer 1** We appreciate your valuable feedback. Please let us know if you have any further comments or feedback. We will do our best to address them. > Q1. Why not having compare with DADS, MUSIC and LSD? We would also have like to see experiments in more known challenging environment like manipulation where MI is the most in trouble. A1. We agree that including more baseline algorithms such as DADS, MUSIC, and LSD [1,2,3] would further increase the impact of DISCO-DANCE, and it's indeed a valuable direction for subsequent work. Our primary focus in this work, however, is to tackle an issue of pessimistic exploration inherent in MI-based skill discovery algorithms. As highlighted in Section 5 (line 297:302, 312:315), our chosen baselines are algorithms which focus on devising an effective auxiliary reward ($r_\text{exploration}$) to help exploration. On the other hand, DADS, MUSIC and LSD mainly focus on modeling the representation of skills ($r_\text{skill}$). This is why these algorithms are not included in our paper as baselines. Furthermore, our experimental environments are chosen to measure the effectiveness of the devised $r_\text{exploration}$. To evaluate how well $r_\text{exploration}$ helps overcome pessimistic exploration, we utilized environments such as 2D and Ant mazes. In such settings, our approach outperformed baseline algorithms, validating the effectiveness of $r_\text{exploration}^\text{DISCO-DANCE}$ in enhancing the exploration. Moreover, to demonstrate DISCO-DANCE aids in not only navigations tasks but also in learning useful skills in more diverse settings, we conducted experiments on a widely-used continuous control benchmark, URLB [4], and was also able to surpass all regarding baselines. To clarify the significance of devising a well-designed $r_\text{exploration}$, we conduct additional experiments on LSD (with discrete skills) within a 2D bottleneck maze (please refer to the Figure $c$ in the attached pdf). Utilizing their open-source code, we optimized three important hyperparameters (learning rate, entropy coefficient of SAC ($\alpha$), dimension of skills). As the main obejctive of LSD is to obtain 'far-reaching' skills, far-reaching skills are well learned in an obstacle-free empty environment (similar to Figure 4 in the LSD paper). However, Figure $c$ shows that LSD struggles to efficiently navigate in bottlneck maze, underscoring that replacing $r_\text{skill}^\text{DIAYN}$ with $r_\text{skill}^\text{LSD}$ is not enough for solving hard-exploration environments. This emphasizes the importance of well-designed $r_\text{exploration}$, regardeless of what $r_\text{skill}$ is used. [1] Dynamics-Aware Unsupervised Discovery of Skills. Sharma et al., ICLR 2020 [2] Mutual Information State Intrinsic Control. Zhao et al., ICLR 2021 [3] Lipchitz-constrained Unsupervised Skill Discovery. Part et al., ICLR 2022 [4] URLB: Unsupervised Reinforcement Learning Benchmark. Laskin et al., NeurIPS 2021 > Q2. Is the random walk used at the last step of the guide skill discovery is scalable to larger state environments ? A2. Please refer to General Response #1, titled **Genereal Response: Random walk process in high-dimensional state space**. **Dear reviewer 2** We appreciate your insightful questions and positive support. Please let us know if you have any further comments or feedback. We will do our best to address them. > Q1. Appendix F describes the additional costs of the random walk process, and proposes an efficient method. Are the additional steps used to perform random walks, even for the temporary buffer, included as environment steps in the overall training budget? If not, this seems an unfair advantage given exclusively to DISCO-DANCE when comparing to other methods. A1. Yes, the additional steps used to perform random walks have been included in the overall count of environment steps for training. > Q2. The hyperparameters used for UPSIDE (from Appendix B) seem to limit the number of learned skills to 8 while DISCO-DANCE seems to start with more skills and has the possibility to expand. Can you comment on this choice? Wouldn't a higher number of skills for UPSIDE reduce the described fine-tuning difficulty by allowing more coverage? A2. UPSIDE sets the initial state as the parent node, from which it generates $N_\text{start}=2$ skills (i.e., leaf nodes) that move a short distance from parent node. UPSIDE then incrementally adds new leaf node to sufficiently cover the state space around the parent node, up to a maximum of $N_\text{max}=8$ nodes. Subsequently, among these leaf nodes, the most discriminable skill is chosen and set as a new parent node, which then generates its leaf nodes. This procedure of skill tree expansion is iteratively executed. Hence, if there is remaining state space in the environment, UPSIDE can continue to increase the number of skills through the aforementioned process until the end of training. In our experiments, we performed hyper-parameter search for $N_\text{max}$ from 6 to 10, and found that there was minimal difference in performance. Therefore, we selected 8 for $N_\text{max}$ (which was used in orignal paper). > Q3. DISCO-DANCE seems to require skills to reliably end in a terminal state. How are those states tracked and determined when the skills are still being trained? A3. As outlined in Algorithm 1 on page 5, line 5, the guide skill is chosen once the majority of skills are discriminable enough (i.e., high discriminator accuracy). This indicates that the terminal states achieved by each skill are consistent. This ensures that, when selecting the guide skill, most skills will consistently end in the (almost) same terminal state across multiple rollouts. **Dear reviewer 3** Thank you very much for your time and insightful comments. Please let us know if you have any further comments or feedback. We will do our best to address them. > Q1. The exploration issue with the mutual information (MI) objective could be more than what is described in this work. In theory, the MI objective is not supposed to contribute to the exploration meaningfully, especially in continuous control environments (Park et al. [21]), which can make this method mostly rely on the random walk processes for its exploration. A1. According to the findings presented in Park [21], they suggest "MI objective can be fully maximized even with small differences in states as long as different z's correspond to even marginally different ${s_{T}}'s$, not necessarily encouraging more 'interesting' skills". I believe this aligns with our discussion in lines 29-32. However, if there are any differences between the two that I might have overlooked, I would be really appreciate if the reviewer could inform us. We will reflect it in the main paper. > Q2. I believe one important weakness of this submission is the random walk process. The manuscript mentions that the rise of the environmental complexity makes existing skill discovery methods less effective and motivates this work, but ironically, in complex environments (e.g., with high-dimensional state spaces), random walk would be one of the main bottlenecks in encouraging exploration. In such environments, this algorithm could require a large number of iterations. A2. Please refer to the general response #1, titled **General Response: Random walk process in high-dimensional state space**. > Q3. In terms of writing, I think it is not very fair to call the state spaces of the environments used for the benchmark high-dimensional. They are higher-dimensional compared to the 2D maze environment, but labeling them high-dimensional in general may not be a good standard for the field. A3. We agree. We will revise it in the final manuscript. > Q4. Do you think there could be an alternative exploration strategy other than the random walk process? A4. Yes, there are alternative strategies such as RND[1], ICM[2] in order to choose the skill that visits the state closest to the least dense states. However, such approach requires additional network training. Since the purpose of guide skill selection process is merely to identify the skill with the highest potential to access unexplored states, we find that a simple random walk process is sufficient. In General response #1, we show that even in Montezuma's Revenge - which features a high dimensional pixel-based state space and notorious for exploration challenges - 100 random walks are enough to select the guide skill. We have shown that in our benchmarks (e.g., Ant mazes, DMC, and pixel-based Atari in General response #1), simple random walk process successfully identifies the guide skill without additional training. We believe that in more challenging environments (e.g., real-world robotics), employing well-designed exploration strategy will be beneficial, which we leave for future work. [1] Exploration by Random Network Distillation. Burda et al., Arxiv 2018 [2] Curiosity-driven Exploration by Self-supervised Prediction. Pathak et al., ICML 2017 [3] Self-Supervised Exploration via Disagreement. Pathak et al., ICML 2019 > Q5. The authors state some limitations of the proposed method (difficulty in high-dimensional state spaces and stochastic environments), but I encourage the authors to consider taking the points I listed in the Weaknesses section into account. A5. We will add (i) in-depth analysis on the advantage and limitation of random walk process (including General Respone #1) in Section "Limitations and Future Directions", and (ii) mored detailed explanation of exploration issue with mutual information in Section 2. **Dear reviewer 4** We appreciate the thorough review and thoughtful comments. Please let us know if you have any further comments or feedback. We will do our best to address them. > Q1. Measuring the density of the state distribution via generating random walk arrival states from terminal states is not sample efficiency. > To select guide skills for exploration, the proposed method assumes the terminal state is resettable and needs to measure the density of the state by generating about O(PRM) random walks arrival states from terminal states as section 3.1 mentioned. I wonder whether is there any more sample-efficient and easy ways to do that. e.g. Is it equivalent to picking a guide skill by directly estimating the density of terminal states sampled from history episodes or fixed-length horizon? if the skill is discriminable (MI reward is well optimized), the terminal states from the same skill may locate in a subregion of (unexplored) states. A1. We really appreciate your feedback. To begin, we wish to clarify that our approach does not strictly assume the terminal state is resettable. While leveraging the simulation environment may allow for hard resets to a terminal state of each skill, we consider such an assumption to be unrealistic. Instead, we just rollout each skill to reach the terminal state. Spicifically, in the context of the Efficient Random Walk Process (detailed in Appendix F), during training, once the selected skill $z_i$ is clearly distinguishable (i.e., $q_\phi(z^i|S_T) > \epsilon$), we just perform an additional 0.2$T$ random walks from the terminal state and perform the density estimation based on these 0.2$T$ number of collected states. As the reviewer pointed out, there is an alternative approach where we could directly estimate density using the states from replay buffer, without random walk process. However, such strategy could result as a scenario as in Figure 2(b). As corroborated by the findings presented in Table 2 and Figure 8, direct estimation from the replay buffer (i.e., no random walk) during guide skill selection can potentially give rise to situations where the nearest skill to unexplored states remains unidentified. Instead, a skill distanced farthest from other skills might be chosen, consequently leading to diminished performance. This underscores that Random Walk Process is necessary for DISCO-DANCE. In addition, Random Walk Process is triggered only when the skills are sufficiently distinctive. Therefore, during training, the random walk process is not frequently executed (e.g., 8 times in 2D maze, 12 times in AntMaze, and 18 times in DMC). Moreover, for each Efficient Random Walk Process, it requires just an additional $0.2T$ steps for each skill. Considering its frequency, and amount of required steps for each skill, the total steps taken by the Efficient Random Walk Process are not extensive. In summary, the Efficient Random Walk Process consumes a relatively minor portion compared to the overall training steps. It's important to note that while the random walk process does introduce some extra steps, exploration signals from the random walk process makes DISCO-DANCE a more efficient explorer compared to the other baselines. Consistently, DISCO-DANCE surpasses the performance of the other baselines when measured against the same number of environment steps.  > Q2. The method needs to select a guide skill that is most adjacent to the unexplored states. In the bottleneck maze tasks in Figure 3, what if all skills including the guide skill cannot pass the first room? Will this method also encourages effective exploration? A2. Yes. To demonstrate that DISCO-DANCE is beneficial in such scenarios, we provide snapshots of our 2D bottleneck maze experiments (please refer to the qualitative figure(b) in attached pdf). As shown in the figure, even all skills cannot pass the first room, the skill with the highest potential to approach the unexplored states (i.e., next room) is chosen as the guide skill through the random walk process. This, in turn, motivates apprentice skills to move towards more explorable states. > Q3. What does the y-axis mean in Figure 3(b)? A4. The y-axis represents each individual skill policy that the agent has (in no particular order). We will add more explanation in the main text. **Dear reviewer 5** We appreciate your insightful and constructive feedback.  > Q1. Why use common benchmarks (AntMaze, DMC) to evalute the exploration ability?  A1. As the reviewer pointed out, 2D mazes (especially bottleneck maze) has more complex layouts than other two benchmarks. However, despite this, the other two environments still require exploration for effective MI-based skill discovery agents. For Antmaze, while its environmental layout may seem simpler than 2D maze, it is noteworthy that the dimensionality of the state space and action space is considerably higher. This intricacy makes the agent more callenging to optimize RL policy. Consequently, Antmaze presents as a complex environment for Unsupervised Skill Discovery agents, necessitating exploration strategies. In practice, as shown in Figure 7, we can find out the agent cannot go far from the initial state as the maze layout becomes more complex (Empty-maze $\rightarrow$ U-maze $\rightarrow$ $\Pi$-maze). This empirical trend underscores the challenging nature of the Antmaze environment and affirms its suitability to measure the agent's exploration ability.  We agree that DMC is not typically used as a hard exploration environment like Atari’s Montezuma in general RL literature. However, for unsupervised skill discovery (USD) agent, DMC presents a non-trivial exploration challenge. For example, in Cheetah, a USD agent may easily stay near the initial location due to the pessimistic exploration problem. This occurs because the discriminator can easily distinguish skills by observing ‘slight movements’ (e.g., marginally lifting joint 2 and not moving further). As a result, the agent may not learn to move further (i.e., run) since skills that involve staying near the starting point are already easily distinguishable. Therefore, without additional exploration signals (e.g., exploration reward), learning running skills is not easy. URLB involves a variety of tasks within a single environment (e.g., Cheetah-run, run_backward, flip, flip_backward). Therefore, merely excelling in one task (e.g., run) does not guarantee high performance in others (e.g., flip backward). The DMC benchmark evaluates the diversity of the learned skills, meaning that agents must learn a suitable set of skills for all downstream tasks in order to achieve consistently high scores across all tasks. The results indicate that DISCO-DANCE outperforms the other baseline methods, as evidenced by the aggregated IQM value (i.e., DISCO-DANCE learns diverse skills to quickly adapt to diverse tasks)  > Q2. How do you add a new skill? Do you perform a neural network surgery or specify a sufficiently large number of skills at the beginning of training? A2. We set a maximum number of skills that DISCO-DANCE can acquire (e.g., 100) and initialized the network accordingly (e.g., first FC layer of policy as nn.Linear(observation_dimension + 100, hidden_dimension)). > Q3. Are the total number of skills trained by DISCO-DANCE and baselines the same in experiments? A3. Yes. We first conducted experiments with DISCO-DANCE for each environment. Based on the maximum number of skills DISCO-DANCE acquired for each environment, we set the total number of skills for other algorithms accordingly. > Q4. In the first sentence of abstract, I think diminished bonuses (e.g. DIAYN) are not equivalent to "penalties". A4. We agree. We will revise this to "a significant reduction in reward acquisition". > Q5. What is the x/y-axis of fig.3(b)? A5. The x-axis shows how much the cheetah has moved horizontally. The y-axis represents each individual skill policy that the agent has (in no particular order). We will add more explanation in the main text. > Q6. Fig.6 is wierd. Why not present, e.g., the success rate over 100 trials averaged over 20 seeds with error bars? A6. We agree that the reviewer's feedback is valid. However, unlike typical goal-conditioned RL, where trials are performed 100 times per seed with varying goals and success rates are averaged across all seeds, our approach focused on a single, fixed goal - the most distant point from the initial state. This design choice was made to evaluate the effectiveness of previously acquired skills in reaching challenging states (i.e., farthest state from the initial state). As a result, if a particular seed learned a policy that reaches this fixed goal, it would succeed in nearly all of its 100 trials. On the other hand, if it hasn't learned the policy the goal yet, it would barely succeed in any of the 100 trials (i.e., the success rate for each seed is almost always either 1 or 0). Therefore, in Fig.6, instead of average success rates, we plotted the number of seeds that succeeded in reaching the goal at each timestep for simplicity. > Q7. In Algorithm 1, the guide skill z* is not defined if the "most skills are discriminable enough" condition is not satisfied. A7. Thank you for correction. We will add "$\text{Initialize guide skill } z^* = \text{ None}$" in line 1 and ""$\text{If guide skill } z^* \text{ is not None:}$"" in line between 7 and 8. > Q8. Link main paper and the Appendix. A8. Thank you for recognizing the depth our results in the Appendix. We will add explicit references to the Appendix content in the main paper. > Q9. The selection of the guide skill depends on the final state visited existing skills. It does not seem to be a general solution to select guide skills even for state-based environments. A9. As outlined in Algorithm 1 on page 5, line 5, the guide skill is chosen once the majority of skills are discriminable enough (i.e., high discriminator accuracy). This indicates the skills reliably end in the same terminal state across multiple rollouts. However, as we mentioned in Appendix I, in highly stochastic environment, it would not be straightforward to select the guide skill just by employing random walk process. However, in highly stochastic environment, the learned skills will visit different states for each rollout, which is a problem not only for DISCO-DANCE, but for all current skill discovery algorithms. We believe this remains an important direction for future work. [1] First return, then explore. Ecoffet et al., Nature 2021.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.