mynsng
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
      • Invitee
    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Versions and GitHub Sync Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
Invitee
Publish Note

Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

Your note will be visible on your profile and discoverable by anyone.
Your note is now live.
This note is visible on your profile and discoverable online.
Everyone on the web can find and read all notes of this public team.
See published notes
Unpublish note
Please check the box to agree to the Community Guidelines.
View profile
Engagement control
Commenting
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
  • Everyone
Suggest edit
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
Emoji Reply
Enable
Import from Dropbox Google Drive Gist Clipboard
   owned this note    owned this note      
Published Linked with GitHub
Subscribed
  • Any changes
    Be notified of any changes
  • Mention me
    Be notified of mention me
  • Unsubscribe
Subscribe
Dear Reviewer gakY, We appreciate your feedback regarding the need for comparisons with algorithms such as DADS, MUSIC, and LSD. Initially, as discussed in our response, we classified skill discovery methods into two broad categories: 1. Algorithms that utilize $r_{skill}$ with a mutual information objective while focusing on constructing an effective auxiliary exploration reward $r_{exploration}$. 2. Methods that seek to redefine $r_{skill}$ beyond mutual information. In our earlier communication, we positioned DISCO-DANCE in the former category, believing that its closest counterparts were the algorithms targeting the exploration reward. This informed our initial decision to exclude DADS, MUSIC, and LSD from our comparisons, as they were primarily aligned with the latter category. However, upon your feedback and deeper reflection, we recognize the value of broadening our comparative analysis. Even if DADS, MUSIC, and LSD primarily fall into the second category, these algorithms may overcome the inherent pessimism associated with the mutual information objective. In response, we have decided to extend our comparative analysis with LSD to ensure a comprehensive evaluation of skill discovery algorithms. Among those, we selected LSD, the algorithm known as the top performer among the reviewer has suggested. Due to time constraints, we narrowed our focus to the Ant-$\Pi$ maze, the environment where the agent suffer from exhaustive exploration. | $r_\text{skill}$ | $r_\text{exploration}$ |state coverage | | -------- | -------- | -------- | | DIAYN | None | 22.50±3.34 | | DIAYN | DISCO-DANCE | 39.00±4.85 | | LSD | None | 38.80±3.34 | | LSD | DISCO-DANCE | 45.80±3.34 | The above table illustrates the performance comparison with LSD. A noteworthy observation is that, unlike in the 2D-bottleneck-maze (Fig.c in attached pdf), both DISCO-DANCE and LSD exhibit comparable performance in the Ant-$\Pi$ maze. This potentially underscores LSD's capability to mitigate the inherent pessimism of mutual information objective. While the performance of LSD is noteworthy, it doesn't diminish the significance of DISCO-DANCE. This is because $r_{skill}$, when decoupled from the mutual information objective, and $r_{exploration}$ as guided by DISCO-DANCE, can work in tandem, complement each other. We tested this by combining DISCO-DANCE with LSD, which led to significant performance improvements. Thus, in environments that are challenging to explore, DISCO-DANCE can serve as a key role in improving exploration. In summary, we acknowledge that our initial experiments primarily utilized MI-based algorithms, specifically DIAYN. Yet, as shown in our additional experiments, DISCO-DANCE not only outperforms LSD in both the 2D-bottleneck-maze and Ant-$\Pi$ maze, but also exhibits augmented performance improvements when integrated with LSD. We are also in the process of conducting experiments with MUSIC to provide a more comprehensive evaluation across various environments. In our revised manuscript, we will incorporate these results, along with additional ablation studies where DISCO-DANCE is utilized on top of different skill discovery objectives. We hope our response alleviates your concerns. Best, Paper 12428 authors --- Dear Reviewer gakY, Previous skill discovery algorithms attempts to address the inherent pessimistic exploration problem in MI based method (described in Section 2.1). These strategies can be categorized into two principal classes: (i) one concentrates on formulating an effective auxiliary exploration reward which encourages the exploration of skills (i.e., $r_\text{exploration}$) and (ii) the other focuses on modifying the MI based skill learning objective (i.e., $r_\text{skill}$). DISCO-DANCE belongs to the first category. Therefore, throughout our manuscript, we fixed $r_\text{skill}$ with a MI based method (i.e., DIAYN) and focused on comparing DISCO-DANCE against baselines which also aim in modeling an effective $r_\text{exploration}$. However, the reviewer requested a comparison between DISCO-DANCE and methods which we believe to fall within the second category. In order to alleviate the reviewer's concerns regarding the effectiveness of DISCO-DANCE against suggested baselines, we performed additional experiments in order to verify the effectiveness of our approach against previous methods which the reviewer pointed out. However, due to the limited amount of time, we were only able to perform additional experiments for LSD, which is the best-performing algorithm among the methods recommended by the reviewer and found that DISCO-DANCE outperforms LSD in 2D bottleneck maze (as shown in Figure \(c\)). We would also like to point out that since DISCO-DANCE attempts to improve $r_\text{exploration}$ and LSD attempts to improve $r_\text{skill}$, DISCO-DANCE and LSD are not mutually exclusive. It is possible to apply $r_\text{exploration}^\text{DISCO-DANCE}$ with LSD and we provide empirical proof that such learning objective is superior to the naive MI based approach (i.e., DIAYN). Therefore, to demonstrate if DISCO-DANCE can be used as an add-on-module on top of any arbitrary $r_\text{skill}$ methodology, we compared DIAYN, DIAYN + $r_\text{exploration}^\text{DISCO-DANCE}$, LSD, and LSD + $r_\text{exploration}^\text{DISCO-DANCE}$ in the Ant-$\Pi$ maze. | $r_\text{skill}$ | $r_\text{exploration}$ |state coverage | | -------- | -------- | -------- | | DIAYN | None | 22.50±3.34 | | DIAYN | DISCO-DANCE | 39.00±4.85 | | LSD | None | 38.80±3.34 | | LSD | DISCO-DANCE | 45.80±3.34 | As shown in the table above, $r_\text{exploration}^\text{DISCO-DANCE}$ benefits for both $r_\text{skill}$ methods. Moreover, using better $r_\text{skill}$ with $r_\text{exploration}^\text{DISCO-DANCE}$ achieves the best performance. Of course, we acknowledge that due to the short time frame of the rebuttal, we were unable to compare different $r_\text{skill}$ methods and therefore have limited experimental support. We will add more comparisons with other algorithms such as MUSIC for our final manuscript. In summary, there might be apprehensions regarding the robustness of DISCO-DANCE because all of our experiments were conducted on top of MI-based algorithms (i.e., DIAYN). However, as evident from the aforementioned experiment, it is apparent that DISCO-DANCE can also enhance the performance of different $r_\text{skill}$ methods such as LSD. Thus, considering its capacity to promote effective exploration, it is arguable that DISCO-DANCE holds considerable merit. We hope this additional experiment will address your concerns well, and welcome any additional comments and clarifications. Best, Paper 12428 authors ----- 기존 skill discvoery의 연구들은 a,b,c와 같은 mutual information기반의 skill discovery objective에 추가적인 exploration objective를 쓰는 방법론들과 MUSIC, LSD와 같은 mutual information에 기반하지 않은 skill discovery objective를 사용하는 방법들로 이루어질 수 있습니다. 저희의 연구에서는 skill discovery objective를 mutual information으로 고정해두고 더 좋은 r_exploration을 만드는데 집중을 하였습니다. - [ ] 이에 reviewer께서는 mutual information을 사용하지 않는 skill objective와의 비교를 요청해주셨었습니다. - [ ] 이에 저희는 initial response에서 mutual information을 objective로 사용하지 않는 skill discovery방법론 중 가장 성능이 좋은 LSD를 기준 알고리즘으로 선정하여 2d bottleneck maze에서 실험을 진행해보았고, disco-dance보다 성능이 좋지 않은 것을 확인할 수 있었다고 말씀 드렸습니다. 비록 initial response에서는 DISCO-DANCE가 LSD보다 bottleneck과 같은 hard exploration에서 좋은 결과를 만들었다고 말씀 드렸지만, 이러한 결과가 둘이 상호 배타적 (mutually exclusive)하다는 뜻은 아닙니다. r_skill로서 LSD와 같은 mutual information보다 더 나은 Objective를 쓰고 r_exp로서 DISCO-DANCE와 같은 hard exploration에 강한 알고리즘을 사용한다면 추가적인 benefit을 가져다 줄 수 있을것이라 생각합니다. 따라서, DISCO-DANCE가 LSD와 같은 방법론 위에 add-on-module로서 사용될 수 있는지 확인해보기 위해서 , 2d bottleneck maze와 ant pi maze에서 r_skill로 MI와 LSD를 그리고 r_exp로 아무것도 사용하지 않는 것과 DISCO-DANCE를 사용하는 것을 조합해 가며 실험을 진행해 보았습니다. Table 1. State coverage | $r_\text{skill}$ | $r_\text{exploration}$ | Ant-$\Pi$maze | | -------- | -------- | -------- | | DIAYN | None | 22.50±3.34 | | DIAYN | DISCO-DANCE | 39.00±4.85 | | LSD | None | 38.80±3.34 | | LSD | DISCO-DANCE | 45.80±3.34 | 결과는 blah blah. 그렇기 때문에 add-on-module로서 좋다. 물론 두가지의 환경에서만 실험을 진행해보았기에 실험적 뒷받침에 있어서는 부족함이 있을 수 있다는 것을 인정합니다. 하지만, 가장 어려운 환경 2개에서 했는데 다른 환경에서도 당연히 잘될거라고 확신함. 현재 rebuttal 기간동안은 진행하지 못했지만, revised된 manuscript에는 말씀해주셨던 MUSIC을 포함하여 MUSIC과 LSD와의 comprehensive한 comparison과 MUSIC, LSD에 각각 DISCO-DANCE를 얹었을때의 실험 결과를 보일 것을 약속합니다. 결론적으로, 비록 DISCO-DANCE가 mutual information기반의 skill discovery reward위에서 실험을 진행하였기에 robust하게 좋은 skill discovery알고리즘인지에 대한 걱정이 드실 수 있으나, Ant Maze와 2D maze에서의 LSD의 결과를 DISCO-DANCE를 추가적으로 크게 올려주는 것을 미루어 볼 때, 추가적인 skill exploration을 장려하는 DISCO-DANCE의 알고리즘은 충분히 좋은 가치가 있다고 생각합니다. 감사합니다. ---- <!-- **Answer for R1: Additioanl experiments: Integration of $r_\text{exploration}^\text{DISCO-DANCE}$ with $r_\text{skill}^\text{LSD}$** We appreciate the reviewer's feedback, that comparing other baselines (which utilizes its own $r_\text{skill}$) is important. Indeed, methods like DADS, MUSIC, and LSD have made considerable improvement over DIAYN. As evidenced in the LSD paper, utilizing $r_\text{skill}^\text{LSD}$ in the environment with no obstacles resulted in surpassing those obtained with $r_\text{skill}^\text{DIAYN}$. Specifically, - **DADS** introduces $r_\text{skill}^\text{DADS}$ = the lower bound of $I(\text{next }s;z|s)$, to maximize the diversity of transitions produced in the environment. - **MUSIC** suggests $r^\text{MUSIC}$ = the lower bound of $I(S^a;S^s)$, to maximize the mutual information between the agent state $S^a$ and the surrounding state $S^s$. - **LSD** proposes $r_\text{skill}^\text{LSD}$ = $(\phi(s_{t+1}) - \phi(s_t))^T z$, to encourage the agent to prefer skills with large traveled distance. However, while these methods perform better than DIAYN, they haven't entirely resolved the exploration challenge of unsupervised skill discovery. That is, simply adopting $r_\text{skill}^\text{LSD}$ (or $r_\text{skill}^\text{DADS}$, $r^\text{MUSIC}$) does not entirely address the exploration challenge. To bring clarity, our main paper's Figure 1 can be referenced. As Figure 1b and 1c shows, regardless of the presence of an explorable skill with highest potential to access the unvisited states (i.e., red skill in Figure 1d), previous baseline algorithms, including DADS, MUSIC and LSD, do not exploit that skill for further exploration. In the bottom row Figure \(c\) in attached pdf file (LSD-2D bottleneck maze), while a particular skill navigates to the upper-right room, this skill doesn't assist other skills in getting out of the initial room. This is a core issue we've highlighted through Figure 1, which is pervasive across several skill discovery methodologies, including LSD, MUSIC, and DADS. In such contexts, our $r_\text{exploration}^\text{DISCO-DANCE}$ serves as a generally applicable algorithm that can enhance exploration. Since DISCO-DANCE is an algorithms that (i) identifies a most explorable skill (i.e., skill that reaches upper-right room in Figure \( c \)) and (ii) guide other skills (i.e., many skills in first room in Figure \( c \)), DISCO-DANCE can be also be applied to not only $r_\text{skill}^\text{DIAYN}$, but also for other $r_\text{skill}$ (e.g., Figure (c), the LSD in 2D-square bottleneck). To demonstrate that DISCO-DANCE is generally complementary to diverse types of $r_\text{skill}$, we conducted additional experiments in the 2D bottleneck maze and Ant $\Pi$-maze. Table 1 shows how significantly LSD benefits from the integration of $r_\text{exploration}^{DISCO-DANCE}$. 실험 설명. While we were restricted by time to conduct extensive tests on LSD in multiple environments, we're motivated to expand this score. In our final manuscript, we will add additional experiments on combining DADS, MUSIC, LSD with DISCO-DANCE in our three benchmarks, which will enrich our paper. --> 1. AC 한테 소통하게 해줘서 감사합니다. 2. 우리가 앞선 답변, LSD 추가실험에 더불어서, $r_\text{exploration}^\text{DISCO-DANCE}$ 가 $r_\text{skill}$ LSD 등 과도 결합했을 때 시너지를 가질 수 있음을 입증하기 위해 추가실험을 진행하고 있고, discussion 이 끝나기 전에 (내일까지) 결과를 업로드 하겠다. Dear Reviewer gakY, Regarding the comparison with LSD, in addition to our prior LSD experiments on 2D bottleneck maze, we are in the process of organizing an extra experiments on the Ant-$\Pi$ maze to demonstrate the robustness of $r_\text{exploration}^\text{DISCO-DANCE}$. We hope this additional experiment will address the Reviewer gakY's concerns well. We intent to share the outcomes until Aug 20th 2am EDT. Best, Paper 12428 authors 2d: DIAYN << LSD << LSD + DISCO-DANCE < DIAYN + DISCO-DANCE Ant: DIAYN << LSD == DIAYN + DISCO-DANCE << LSD + DISCO-DANCE **Response to all reviewers** We deeply appreciate all five reviewers for their thoughtful feedback and valuable suggestions. R1, R2, R3, R4 and R5 indicate reviewer gakY, reviewer adLT, reviewer b566, reviewer EWQN, reviewer oWju respectively. Reviewers identified the following strength in our submission: - The main idea is intuitive and well-motivated (R2, R3, R4, R5). - The proposed method was evaluated against several methods in various domains, and the empirical results are promising (R2, R4, R5). At the same time, reviewers identified the following weaknesses in our submission: - Scalability of Random Walk Process to high dimensional state spaces (R1, R3). - Missing baselines and missing experiments in manipulation environments (R1). - Sample inefficiency of Random Walk Process (R4). We hope our responses address all reviewer’s concerns, and we welcome any additional comments and clarifications. **General Response: Random walk process in high-dimensional state space** Reviewer R1 and R3 raised questions regarding the potential scalability of our random walk process within high-dimensional state spaces. To address this, we've identified two essential questions to consider: 1. When the agent starts to explore from an arbitrary terminal state, can the agent visit a diverse range of states through the random walk process? 2. Given the diverse range of visited states, is it possible to identify the terminal state (guide skill) within the least explored region? In response, we conducted a synthetic experiment on Montezuma's Revenge, which is a high-dimensional pixel-based environment characterized by an 84×84×3 input and is well-known for its exploration challenges. Figure (a) in the attached PDF provides a visual illustration of our random walk approach within Montezuma's Revenge. In this experiment, we: 1. Randomly reset the agent in the inital room of Montezuma's Revenge. 2. Execute a random walk for 100 steps from the randomly initialized state. 3. Iterate the step 1,2 for N cycles. This experimental design mirrors the algorithmic design of DISCO-DANCE where each reinitialized point indicates the terminal state of a different skill. Correspondingly, each skill undergoes a random walk for 100 steps, aligning with the parameters used for our paper (i.e., P=N, R=100, M=1 in Equation 4). >1. When the agent starts to explore from an arbitrary terminal state, can the agent visit a diverse range of states through the random walk process? As illustrated by Figure (a), even in such a high-dimensional state space, the agent was able to visit a variety of states within just 100 random steps. For instance, for Skill 'I', the agent was able to move to another room. And with for Skill 'J', the agent successfully picked up the key. This experimental result supports the versatility of the random walk process, even when the environment is high-dimensional. > 2. Given the diverse range of visited states, is it possible to identify the terminal state (guide skill) within the least explored region? After random walk, our aim shifts to identifying the guide skill within the least explored area. For density estimation, we employed the cell-centric technique from Go-Explore [1]: (i) segmenting the aggregated states into discrete cells, (ii) count the number of each cell's visitation, and (iii) select the least visited cell. As emphasized in Figure (a), the cell marked in red is selected as guide skill, indicating that skill 'I' is the prime candidate for guide skill in DISCO-DANCE. This combined approach of exploration through random walk then identifying unique states (which is used in DISCO-DANCE), parallels the approach adopted by Go-Explore. This technique has consistently demonstrated its efficacy across varied domains, including Atari games and robotic settings. In summary, we believe our synthetic experiment affirms the scalability of our random walk process in DISCO-DANCE, even within a high-dimensional pixel-based environment. In the revised manuscript, we'll be integrating these insights into the 'Limitations and Future Directions' section. [1] First return, then explore. Ecoffet et al., Nature 2021. <br> <br> <br> <br> <br> <br> **Dear reviewer 1** We appreciate your valuable feedback. Please let us know if you have any further comments or feedback. We will do our best to address them. > Q1. Why not having compare with DADS, MUSIC and LSD? We would also have like to see experiments in more known challenging environment like manipulation where MI is the most in trouble. A1. We agree that including more baseline algorithms such as DADS, MUSIC, and LSD [1,2,3] would further increase the impact of DISCO-DANCE, and it's indeed a valuable direction for subsequent work. Our primary focus in this work, however, is to tackle an issue of pessimistic exploration inherent in MI-based skill discovery algorithms. As highlighted in Section 5 (line 297:302, 312:315), our chosen baselines are algorithms which focus on devising an effective auxiliary reward ($r_\text{exploration}$) to help exploration. On the other hand, DADS, MUSIC and LSD mainly focus on modeling the representation of skills ($r_\text{skill}$). This is why these algorithms are not included in our paper as baselines. Furthermore, our experimental environments are chosen to measure the effectiveness of the devised $r_\text{exploration}$. To evaluate how well $r_\text{exploration}$ helps overcome pessimistic exploration, we utilized environments such as 2D and Ant mazes. In such settings, our approach outperformed baseline algorithms, validating the effectiveness of $r_\text{exploration}^\text{DISCO-DANCE}$ in enhancing the exploration. Moreover, to demonstrate DISCO-DANCE aids in not only navigations tasks but also in learning useful skills in more diverse settings, we conducted experiments on a widely-used continuous control benchmark, URLB [4], and was also able to surpass all regarding baselines. To clarify the significance of devising a well-designed $r_\text{exploration}$, we conduct additional experiments on LSD (with discrete skills) within a 2D bottleneck maze (please refer to the Figure \(c\) in the attached pdf). Utilizing their open-source code, we optimized three important hyperparameters (learning rate, entropy coefficient of SAC ($\alpha$), dimension of skills). As the main obejctive of LSD is to obtain 'far-reaching' skills, far-reaching skills are well learned in an obstacle-free empty environment (similar to Figure 4 in the LSD paper). However, Figure \(c\) shows that LSD struggles to efficiently navigate in bottlneck maze, underscoring that replacing $r_\text{skill}^\text{DIAYN}$ with $r_\text{skill}^\text{LSD}$ is not enough for solving hard-exploration environments. This emphasizes the importance of well-designed $r_\text{exploration}$, regardeless of what $r_\text{skill}$ is used. [1] Dynamics-Aware Unsupervised Discovery of Skills. Sharma et al., ICLR 2020 [2] Mutual Information State Intrinsic Control. Zhao et al., ICLR 2021 [3] Lipchitz-constrained Unsupervised Skill Discovery. Part et al., ICLR 2022 [4] URLB: Unsupervised Reinforcement Learning Benchmark. Laskin et al., NeurIPS 2021 > Q2. Is the random walk used at the last step of the guide skill discovery is scalable to larger state environments ? A2. Please refer to General Response #1, titled **Genereal Response: Random walk process in high-dimensional state space**. <br> <br> <br> <br> **Dear reviewer 2** We appreciate your insightful questions and positive support. Please let us know if you have any further comments or feedback. We will do our best to address them. > Q1. Appendix F describes the additional costs of the random walk process, and proposes an efficient method. Are the additional steps used to perform random walks, even for the temporary buffer, included as environment steps in the overall training budget? If not, this seems an unfair advantage given exclusively to DISCO-DANCE when comparing to other methods. A1. Yes, the additional steps used to perform random walks have been included in the overall count of environment steps for training. > Q2. The hyperparameters used for UPSIDE (from Appendix B) seem to limit the number of learned skills to 8 while DISCO-DANCE seems to start with more skills and has the possibility to expand. Can you comment on this choice? Wouldn't a higher number of skills for UPSIDE reduce the described fine-tuning difficulty by allowing more coverage? A2. UPSIDE sets the initial state as the parent node, from which it generates $N_\text{start}=2$ skills (i.e., leaf nodes) that move a short distance from parent node. UPSIDE then incrementally adds new leaf node to sufficiently cover the state space around the parent node, up to a maximum of $N_\text{max}=8$ nodes. Subsequently, among these leaf nodes, the most discriminable skill is chosen and set as a new parent node, which then generates its leaf nodes. This procedure of skill tree expansion is iteratively executed. Hence, if there is remaining state space in the environment, UPSIDE can continue to increase the number of skills through the aforementioned process until the end of training. In our experiments, we performed hyper-parameter search for $N_\text{max}$ from 6 to 10, and found that there was minimal difference in performance. Therefore, we selected 8 for $N_\text{max}$ (which was used in orignal paper). > Q3. DISCO-DANCE seems to require skills to reliably end in a terminal state. How are those states tracked and determined when the skills are still being trained? A3. As outlined in Algorithm 1 on page 5, line 5, the guide skill is chosen once the majority of skills are discriminable enough (i.e., high discriminator accuracy). This indicates that the terminal states achieved by each skill are consistent. This ensures that, when selecting the guide skill, most skills will consistently end in the (almost) same terminal state across multiple rollouts. <br> <br> <br> <br> **Dear reviewer 3** Thank you very much for your time and insightful comments. Please let us know if you have any further comments or feedback. We will do our best to address them. > Q1. The exploration issue with the mutual information (MI) objective could be more than what is described in this work. In theory, the MI objective is not supposed to contribute to the exploration meaningfully, especially in continuous control environments (Park et al. [21]), which can make this method mostly rely on the random walk processes for its exploration. A1. According to the findings presented in Park [21], they suggest "MI objective can be fully maximized even with small differences in states as long as different z's correspond to even marginally different ${s_{T}}'s$, not necessarily encouraging more 'interesting' skills". I believe this aligns with our discussion in lines 29-32. However, if there are any differences between the two that I might have overlooked, I would be really appreciate if the reviewer could inform us. We will reflect it in the main paper. > Q2. I believe one important weakness of this submission is the random walk process. The manuscript mentions that the rise of the environmental complexity makes existing skill discovery methods less effective and motivates this work, but ironically, in complex environments (e.g., with high-dimensional state spaces), random walk would be one of the main bottlenecks in encouraging exploration. In such environments, this algorithm could require a large number of iterations. A2. Please refer to the general response #1, titled **General Response: Random walk process in high-dimensional state space**. > Q3. In terms of writing, I think it is not very fair to call the state spaces of the environments used for the benchmark high-dimensional. They are higher-dimensional compared to the 2D maze environment, but labeling them high-dimensional in general may not be a good standard for the field. A3. We agree. We will revise it in the final manuscript. > Q4. Do you think there could be an alternative exploration strategy other than the random walk process? A4. Yes, there are alternative strategies such as RND[1], ICM[2] in order to choose the skill that visits the state closest to the least dense states. However, such approach requires additional network training. Since the purpose of guide skill selection process is merely to identify the skill with the highest potential to access unexplored states, we find that a simple random walk process is sufficient. In General response #1, we show that even in Montezuma's Revenge - which features a high dimensional pixel-based state space and notorious for exploration challenges - 100 random walks are enough to select the guide skill. We have shown that in our benchmarks (e.g., Ant mazes, DMC, and pixel-based Atari in General response #1), simple random walk process successfully identifies the guide skill without additional training. We believe that in more challenging environments (e.g., real-world robotics), employing well-designed exploration strategy will be beneficial, which we leave for future work. [1] Exploration by Random Network Distillation. Burda et al., Arxiv 2018 [2] Curiosity-driven Exploration by Self-supervised Prediction. Pathak et al., ICML 2017 [3] Self-Supervised Exploration via Disagreement. Pathak et al., ICML 2019 > Q5. The authors state some limitations of the proposed method (difficulty in high-dimensional state spaces and stochastic environments), but I encourage the authors to consider taking the points I listed in the Weaknesses section into account. A5. We will add (i) in-depth analysis on the advantage and limitation of random walk process (including General Respone #1) in Section "Limitations and Future Directions", and (ii) mored detailed explanation of exploration issue with mutual information in Section 2. <br> <br> <br> <br> **Dear reviewer 4** We appreciate the thorough review and thoughtful comments. Please let us know if you have any further comments or feedback. We will do our best to address them. > Q1. Measuring the density of the state distribution via generating random walk arrival states from terminal states is not sample efficiency. > To select guide skills for exploration, the proposed method assumes the terminal state is resettable and needs to measure the density of the state by generating about O(PRM) random walks arrival states from terminal states as section 3.1 mentioned. I wonder whether is there any more sample-efficient and easy ways to do that. e.g. Is it equivalent to picking a guide skill by directly estimating the density of terminal states sampled from history episodes or fixed-length horizon? if the skill is discriminable (MI reward is well optimized), the terminal states from the same skill may locate in a subregion of (unexplored) states. A1. We really appreciate your feedback. To begin, we wish to clarify that our approach does not strictly assume the terminal state is resettable. While leveraging the simulation environment may allow for hard resets to a terminal state of each skill, we consider such an assumption to be unrealistic. Instead, we just rollout each skill to reach the terminal state. Spicifically, in the context of the Efficient Random Walk Process (detailed in Appendix F), during training, once the selected skill $z_i$ is clearly distinguishable (i.e., $q_\phi(z^i|S_T) > \epsilon$), we just perform an additional 0.2$T$ random walks from the terminal state and perform the density estimation based on these 0.2$T$ number of collected states. As the reviewer pointed out, there is an alternative approach where we could directly estimate density using the states from replay buffer, without random walk process. However, such strategy could result as a scenario as in Figure 2(b). As corroborated by the findings presented in Table 2 and Figure 8, direct estimation from the replay buffer (i.e., no random walk) during guide skill selection can potentially give rise to situations where the nearest skill to unexplored states remains unidentified. Instead, a skill distanced farthest from other skills might be chosen, consequently leading to diminished performance. This underscores that Random Walk Process is necessary for DISCO-DANCE. In addition, Random Walk Process is triggered only when the skills are sufficiently distinctive. Therefore, during training, the random walk process is not frequently executed (e.g., 8 times in 2D maze, 12 times in AntMaze, and 18 times in DMC). Moreover, for each Efficient Random Walk Process, it requires just an additional $0.2T$ steps for each skill. Considering its frequency, and amount of required steps for each skill, the total steps taken by the Efficient Random Walk Process are not extensive. In summary, the Efficient Random Walk Process consumes a relatively minor portion compared to the overall training steps. It's important to note that while the random walk process does introduce some extra steps, exploration signals from the random walk process makes DISCO-DANCE a more efficient explorer compared to the other baselines. Consistently, DISCO-DANCE surpasses the performance of the other baselines when measured against the same number of environment steps. <!-- We really appreciate your feedback. In fact, we have incorporated considerations of efficient random walk processes in our experiments, as can be seen in the DMC experiment (Figure 7 in the main paper) and Appendix F. First, we do not strictly assume the terminal state is resettable (e.g., simulation environment). As outlined in Appendix F, the total environment steps in a single random walk process is $P\text{ skills }*(T\text{ (horizon) }+0.2T\text{ (random walk)})$ $*$ $M$ ($\text{repeated}$ $M$ $\text{times}$). This means we don't reset the environment to the terminal state for random walk; instead, we take $1.2T$ environment steps from the initial state for random walk. In both 2D and Ant mazes, these extra random walk steps are involed in total training steps, and DISCO-DANCE outperform other baselines. Although random walk process requires extra environmental steps, DISCO-DANCE can effectively explore the environment by accessing direct exploration signal than other baselines. As result, DISCO-DANCE outperforms other baselines for the same environment steps. However, as the reviewer pointed out, this may not be sample efficient in environments with longer hotizon $T$ (e.g., $T=1000$ in DMC). Therefore, in DMC, we adopted an 'efficient random walk process' (**elaborated in Appendix F**) that doesn't need additional $P*(1.2T)*M$ steps. Specifically, if a selected skill $z_i$ is clearly distinguishable (i.e., $q_\phi(z^i|S_T) < \epsilon$), we just perform an extra $0.2T$ random walks at the end of the horizon and store these states in a 'temporary buffer'. Guide skill selection is conducted with the states in 'temporary buffer', without additional environment steps. This significantly reduce the number of environment steps compared to original random walk process. Our DMC experiments demonstrate that the efficient random walk process still outperform other baselines. The reviewer's example (picking a guide skill by directly estimating the density of terminal states) corresponds to Figure 2(b). It is important to note that random walk (either the original or the efficient version) is necessary for DISCO-DANCE. As demonstrated in Table 2 and Figure 8, the absent of the random walk during guide skill selection could lead to a scenario where the nearest skill to unexplored states is not identified. Instead, a skill distanced farthest from other skills might be chosen, consequently leading to diminished performance. --> > Q2. The method needs to select a guide skill that is most adjacent to the unexplored states. In the bottleneck maze tasks in Figure 3, what if all skills including the guide skill cannot pass the first room? Will this method also encourages effective exploration? A2. Yes. To demonstrate that DISCO-DANCE is beneficial in such scenarios, we provide snapshots of our 2D bottleneck maze experiments (please refer to the qualitative figure(b) in attached pdf). As shown in the figure, even all skills cannot pass the first room, the skill with the highest potential to approach the unexplored states (i.e., next room) is chosen as the guide skill through the random walk process. This, in turn, motivates apprentice skills to move towards more explorable states. > Q3. What does the y-axis mean in Figure 3(b)? A4. The y-axis represents each individual skill policy that the agent has (in no particular order). We will add more explanation in the main text. <br> <br> <br> <br> **Dear reviewer 5** We appreciate your insightful and constructive feedback. <!-- Please let us know if you have any further comments or feedback. We will do our best to address them. --> > Q1. Why use common benchmarks (AntMaze, DMC) to evalute the exploration ability? <!-- > Q1. Complexity of the locomotion tasks (AntMaze and DMC). > Q1. The authors tend to address the problem unsupervised discovery in complex environments where existing methods are no longer effective, but experiments are mainly conducted on common benchmarks. I acknowledge that several maps in the navigation task are challenging, but the locomotion tasks (AntMaze and DMC) are not. Actually, baseline algorithms can outperform the proposed algorithm in terms of downstream task performance (see Appendix G). --> A1. As the reviewer pointed out, 2D mazes (especially bottleneck maze) has more complex layouts than other two benchmarks. However, despite this, the other two environments still require exploration for effective MI-based skill discovery agents. For Antmaze, while its environmental layout may seem simpler than 2D maze, it is noteworthy that the dimensionality of the state space and action space is considerably higher. This intricacy makes the agent more callenging to optimize RL policy. Consequently, Antmaze presents as a complex environment for Unsupervised Skill Discovery agents, necessitating exploration strategies. In practice, as shown in Figure 7, we can find out the agent cannot go far from the initial state as the maze layout becomes more complex (Empty-maze $\rightarrow$ U-maze $\rightarrow$ $\Pi$-maze). This empirical trend underscores the challenging nature of the Antmaze environment and affirms its suitability to measure the agent's exploration ability. <!-- For Antmaze, 2D maze보다 environment layout이 더 simple하지만, state space와 action space가 더 higher하다. 즉 2D maze에서는 policy를 optimize하는 것이 상대적으로 더 easy하기 때문에 (2 dimensional state space) 더 복잡하고 어려운 maze layout을 사용했다면, 상대적으로 AntMaze는 2D maze보다 RL policy 자체를 optimize하는 것이 더 어렵기 때문에, Ant $\Pi$-maze와 같은 정도로도 unsupervised skill agent가 explore하기가 쉽지 않은 환경이다. 그럼에도 불구하고, Figure 7 (in Appendix L)은 maze layout이 복잡해질수록 (Empty-maze $\rightarrow$ U-maze $\rightarrow$ $\Pi$-maze) agent가 initial point로부터 멀리 나아가지 못하는 것을 보여주고, 이는 Ant Mazes도 Exploration ability를 측정하기에 적합한 환경임을 보여준다. --> We agree that DMC is not typically used as a hard exploration environment like Atari’s Montezuma in general RL literature. However, for unsupervised skill discovery (USD) agent, DMC presents a non-trivial exploration challenge. For example, in Cheetah, a USD agent may easily stay near the initial location due to the pessimistic exploration problem. This occurs because the discriminator can easily distinguish skills by observing ‘slight movements’ (e.g., marginally lifting joint 2 and not moving further). As a result, the agent may not learn to move further (i.e., run) since skills that involve staying near the starting point are already easily distinguishable. Therefore, without additional exploration signals (e.g., exploration reward), learning running skills is not easy. URLB involves a variety of tasks within a single environment (e.g., Cheetah-run, run_backward, flip, flip_backward). Therefore, merely excelling in one task (e.g., run) does not guarantee high performance in others (e.g., flip backward). The DMC benchmark evaluates the diversity of the learned skills, meaning that agents must learn a suitable set of skills for all downstream tasks in order to achieve consistently high scores across all tasks. The results indicate that DISCO-DANCE outperforms the other baseline methods, as evidenced by the aggregated IQM value (i.e., DISCO-DANCE learns diverse skills to quickly adapt to diverse tasks) <!-- While AntMaze might seem simpler, it's worth noting that due to the complex movements involving multiple joints for navigation in both x and y planes, Ant $\Pi$-maze is used as a challenging environment in goal-conditioned RL studies [1,2]. The significance of the AntMaze experiment lies in the progressive complexity of its layout, where baseline performance tends to deteriorate with increased complexity. As the reviewer pointed out, while SMM slightly leads DISCO-DANCE in the Empty maze, DISCO-DANCE is substantially surpassed in the $\Pi$-maze. Although DMC is not conventionally regarded as a challenging environment in RL literature, it is a difficult environment for learning various behaviors such as run, flip, and jump in the unsupervised skill discovery literature because it is learning without reward. Consequently, our experiment aim to assess whether the USD algorithm effectively acquired these diverse behaviors through efficient exploration. We accomplished this by comparing the cumulative scores across various downstream tasks, serving as our primary metric. The results indicate that DISCO-DANCE outperforms the other baseline methods, as evidenced by the aggregated IQM value. --> > Q2. How do you add a new skill? Do you perform a neural network surgery or specify a sufficiently large number of skills at the beginning of training? A2. We set a maximum number of skills that DISCO-DANCE can acquire (e.g., 100) and initialized the network accordingly (e.g., first FC layer of policy as nn.Linear(observation_dimension + 100, hidden_dimension)). > Q3. Are the total number of skills trained by DISCO-DANCE and baselines the same in experiments? A3. Yes. We first conducted experiments with DISCO-DANCE for each environment. Based on the maximum number of skills DISCO-DANCE acquired for each environment, we set the total number of skills for other algorithms accordingly. > Q4. In the first sentence of abstract, I think diminished bonuses (e.g. DIAYN) are not equivalent to "penalties". A4. We agree. We will revise this to "a significant reduction in reward acquisition". > Q5. What is the x/y-axis of fig.3(b)? A5. The x-axis shows how much the cheetah has moved horizontally. The y-axis represents each individual skill policy that the agent has (in no particular order). We will add more explanation in the main text. > Q6. Fig.6 is wierd. Why not present, e.g., the success rate over 100 trials averaged over 20 seeds with error bars? A6. We agree that the reviewer's feedback is valid. However, unlike typical goal-conditioned RL, where trials are performed 100 times per seed with varying goals and success rates are averaged across all seeds, our approach focused on a single, fixed goal - the most distant point from the initial state. This design choice was made to evaluate the effectiveness of previously acquired skills in reaching challenging states (i.e., farthest state from the initial state). As a result, if a particular seed learned a policy that reaches this fixed goal, it would succeed in nearly all of its 100 trials. On the other hand, if it hasn't learned the policy the goal yet, it would barely succeed in any of the 100 trials (i.e., the success rate for each seed is almost always either 1 or 0). Therefore, in Fig.6, instead of average success rates, we plotted the number of seeds that succeeded in reaching the goal at each timestep for simplicity. > Q7. In Algorithm 1, the guide skill z* is not defined if the "most skills are discriminable enough" condition is not satisfied. A7. Thank you for correction. We will add "$\text{Initialize guide skill } z^* = \text{ None}$" in line 1 and ""$\text{If guide skill } z^* \text{ is not None:}$"" in line between 7 and 8. > Q8. Link main paper and the Appendix. A8. Thank you for recognizing the depth our results in the Appendix. We will add explicit references to the Appendix content in the main paper. > Q9. The selection of the guide skill depends on the final state visited existing skills. It does not seem to be a general solution to select guide skills even for state-based environments. A9. As outlined in Algorithm 1 on page 5, line 5, the guide skill is chosen once the majority of skills are discriminable enough (i.e., high discriminator accuracy). This indicates the skills reliably end in the same terminal state across multiple rollouts. However, as we mentioned in Appendix I, in highly stochastic environment, it would not be straightforward to select the guide skill just by employing random walk process. However, in highly stochastic environment, the learned skills will visit different states for each rollout, which is a problem not only for DISCO-DANCE, but for all current skill discovery algorithms. We believe this remains an important direction for future work. [1] First return, then explore. Ecoffet et al., Nature 2021.

Import from clipboard

Paste your markdown or webpage here...

Advanced permission required

Your current role can only read. Ask the system administrator to acquire write and comment permission.

This team is disabled

Sorry, this team is disabled. You can't edit this note.

This note is locked

Sorry, only owner can edit this note.

Reach the limit

Sorry, you've reached the max length this note can be.
Please reduce the content or divide it to more notes, thank you!

Import from Gist

Import from Snippet

or

Export to Snippet

Are you sure?

Do you really want to delete this note?
All users will lose their connection.

Create a note from template

Create a note from template

Oops...
This template has been removed or transferred.
Upgrade
All
  • All
  • Team
No template.

Create a template

Upgrade

Delete template

Do you really want to delete this template?
Turn this template into a regular note and keep its content, versions, and comments.

This page need refresh

You have an incompatible client version.
Refresh to update.
New version available!
See releases notes here
Refresh to enjoy new features.
Your user state has changed.
Refresh to load new user state.

Sign in

Forgot password

or

By clicking below, you agree to our terms of service.

Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
Wallet ( )
Connect another wallet

New to HackMD? Sign up

Help

  • English
  • 中文
  • Français
  • Deutsch
  • 日本語
  • Español
  • Català
  • Ελληνικά
  • Português
  • italiano
  • Türkçe
  • Русский
  • Nederlands
  • hrvatski jezik
  • język polski
  • Українська
  • हिन्दी
  • svenska
  • Esperanto
  • dansk

Documents

Help & Tutorial

How to use Book mode

Slide Example

API Docs

Edit in VSCode

Install browser extension

Contacts

Feedback

Discord

Send us email

Resources

Releases

Pricing

Blog

Policy

Terms

Privacy

Cheatsheet

Syntax Example Reference
# Header Header 基本排版
- Unordered List
  • Unordered List
1. Ordered List
  1. Ordered List
- [ ] Todo List
  • Todo List
> Blockquote
Blockquote
**Bold font** Bold font
*Italics font* Italics font
~~Strikethrough~~ Strikethrough
19^th^ 19th
H~2~O H2O
++Inserted text++ Inserted text
==Marked text== Marked text
[link text](https:// "title") Link
![image alt](https:// "title") Image
`Code` Code 在筆記中貼入程式碼
```javascript
var i = 0;
```
var i = 0;
:smile: :smile: Emoji list
{%youtube youtube_id %} Externals
$L^aT_eX$ LaTeX
:::info
This is a alert area.
:::

This is a alert area.

Versions and GitHub Sync
Get Full History Access

  • Edit version name
  • Delete

revision author avatar     named on  

More Less

Note content is identical to the latest version.
Compare
    Choose a version
    No search result
    Version not found
Sign in to link this note to GitHub
Learn more
This note is not linked with GitHub
 

Feedback

Submission failed, please try again

Thanks for your support.

On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

Please give us some advice and help us improve HackMD.

 

Thanks for your feedback

Remove version name

Do you want to remove this version name and description?

Transfer ownership

Transfer to
    Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

      Link with GitHub

      Please authorize HackMD on GitHub
      • Please sign in to GitHub and install the HackMD app on your GitHub repo.
      • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
      Learn more  Sign in to GitHub

      Push the note to GitHub Push to GitHub Pull a file from GitHub

        Authorize again
       

      Choose which file to push to

      Select repo
      Refresh Authorize more repos
      Select branch
      Select file
      Select branch
      Choose version(s) to push
      • Save a new version and push
      • Choose from existing versions
      Include title and tags
      Available push count

      Pull from GitHub

       
      File from GitHub
      File from HackMD

      GitHub Link Settings

      File linked

      Linked by
      File path
      Last synced branch
      Available push count

      Danger Zone

      Unlink
      You will no longer receive notification when GitHub file changes after unlink.

      Syncing

      Push failed

      Push successfully