# Reviewer 1 (rating: 8, confidence 3) ### Summary: This paper introduces DORAEMON, a novel domain randomization technique in reinforcement learning, designed to enhance policy generalization across varied environment dynamics. DORAEMON strategically increases the entropy of training distributions, conditioned on achieving a probability of success threshold and by ensuring updates to the entropy are constrained. The aim is to balance both the entropy of the dynamics parameters distribution and task proficiency. Empirical tests across OpenAI Gym benchmarks and on a real-world robotic task highlight DORAEMON's superior adaptability to diverse dynamic settings compared to conventional domain randomization approaches and also demonstrate how the success rate threshold and success definition (i.e., lower bound return threshold) impact performance. This work convincingly demonstrates the potential benefit of applying DORAEMON to systems where sim-to-real policy transfer is important. ##### Soundness: 4 excellent ##### Presentation: 2 fair ##### Contribution: 4 excellent ### Strengths: Originality The paper introduces a novel and innovative approach to domain randomization. This method is an advancement in the field of RL, not only in the scope of domain randomization but potentially in areas of research outside of domain randomization (e.g., meta-RL). The method's innovation stems from its entropy maximization technique that enables a policy to generalize across a broader range of dynamics, while ensuring that the entropy of the dynamics parameter distribution grows in a manner that does not compromise the policy's probability of success. Quality The proposed algorithm, DORAEMON, has been tested in OpenAI Gym benchmarks and in a sim-to-real robotics task, generally demonstrating its superior generalization in comparison to existing domain randomization techniques. Clarity The paper has a clear definition of success and the presentation of the results are well-structured. The mathematics foundations of the paper are clear and sound. The figures and empirical results support the authors claims of the superiority of their method in comparison to traditional domain randomization methods. Significance This paper is significant to the development of autonomous systems and has the potential for real-world application in industry. Summary DORAEMON stands out as an original, high-quality research work with significant implications for both theory and application in reinforcement learning and robotics. ### Weaknesses: The paper presents empirical tests across OpenAI Gym benchmarks and a real-world task. However, there may be a need for more diverse environmental tests to fully understand the limits of DORAEMON's generalization ability. For instance, testing in environments with higher-dimensional state spaces or more complex dynamics could provide a more comprehensive picture of the algorithm's robustness. The research presented is certainly complex and holds significant value to the field. However, some sections of the text may benefit from further clarification to enhance the paper's accessibility to a broader audience. In particular, the density of technical terms and concepts could be balanced with more detailed explanations or simplified language. This could potentially include adding definitions, or providing more background information for non-expert readers. Such revisions would likely make the paper's contributions even more impactful and ensure that a wider range of readers can fully grasp the innovative work you have presented. ### Questions: How sensitive is DORAEMON to its hyperparameters (e.g., trust region size, trajectories per distribution update, definition of success, etc.), and what process was followed to select them? Did you perform a sensitivity analysis? Are there potential negative impacts of DORAEMON that should be discussed, especially regarding its application to real-world systems? While it is beyond the scope of this paper, have you considered the safety of the sim-to-real policy in the real-world robotics task? How does it compare to previous domain randomization methods? --- ##### Flag For Ethics Review: No ethics review needed. ##### Rating: 8: accept, good paper ##### Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. ##### Code Of Conduct: Yes --- ### OUR NOTES: - It might be easy to overlook, but the sim2sim tasks we consider, while being standard benchmark, are much harder than "usual" as they are affected by the number of degree of randomization. Therefore, it's very easy to make these same environments more complex to solve, e.g. by asking the same policy to solve the task under very different friction coefficients or masses. Please refer to our tables X and Y in the appendix, to see that we go up to 17 dimensions of randomized dynamics, with significantly high ranges. These ranges have been chosen as ..., but more complex scenarios may even be considered. - Changed the env titles in the fig to show the number of randomized parameters for each env. "We think that this will make it clearer to reason about the difficulty of the tasks also in terms of the number of randomized parameters." - we also ablate the parametric distribution, and try doraemon with Betas. --- --- --- # RESPONSE We highly appreciate the reviewer's time in evaluating our work and providing valuable feedback. In the following, we address the concerns raised by the reviewer. > For instance, testing in environments with higher-dimensional state spaces or more complex dynamics could provide a more comprehensive picture of the algorithm’s robustness. The OpenAI benchmark, and in particular the considered tasks, feature highly nonlinear dynamics, and altering the various dynamics parameters of each task (see Table 2) leads to significantly challenging problems. This can seen by the failure cases of Fixed-DR throughout our experimental analysis (see Fig. 2 and Table 1), which sometimes generalizes even worse than a policy trained with no randomization (see No-DR vs Fixed-DR in Fig. 11). Moreover, our sim-to-real task allows us to test various dynamics properties of a contact-rich manipulation task (e.g. the box response to contacts under variable centers of mass). Note that we will provide the public codebase with the details of our PandaPush setup for the community to use as a novel sim-to-real benchmark. While more environments would be beneficial for providing more evidence of the effectiveness of DORAEMON, we deem that the current experiments already highlight the benefits of this approach both in sim2sim and sim2real applications. > In particular, the density of technical terms and concepts could be balanced with more detailed explanations or simplified language. This could potentially include adding definitions, or providing more background information for non-expert readers. While we did our best to clearly explain each concept and experiment in the main manuscript, we agree that the content may generally feel a bit tight to fit into 9 pages. In our revision, we made significant additions to the Appendix section in order to add more detailed explanations on the algorithm's behavior (see novel sections A.3, A.4, and B). > How sensitive is DORAEMON to its hyperparameters (e.g., trust region size, trajectories per distribution update, definition of success, etc.), and what process was followed to select them? Did you perform a sensitivity analysis? Based on the reviewer's comment, we **complemented Section A.2 in the Appendix with a sensitivity analysis of the trust region size $\epsilon$** (see Fig. 8), as empirical evidence of the process that we followed to select DORAEMON's hyperparameter. In particular, we tune the value of $\epsilon$ individually for each environment, out of a predefined set of 5 candidates in $\{0.1, 0.05, 0.01, 0.005, 0.001\}$. Likewise, we perform the same search for the baseline methods AutoDR and LSDR, respectively, for their trust region size equivalents (please refer to A.2 for a detailed explanation on the selection process). \ The sensitivity analysis demonstrates that DORAEMON is considerably robust to the choice of $\epsilon$, i.e., it is able to maintain the desired in-distribution success rate even for large trust region sizes. However, the value of $\epsilon$ highly affects the pace of the growing distribution, therefore, we suspect that the optimal value may change from task to task. \ Finally, we also complemented the original sensitivity analysis on the $\alpha$ hyperparameter, by **adding new experiments for the Walker2D and Halfcheetah environments in the new Section A.3**. > Are there potential negative impacts of DORAEMON that should be discussed, especially regarding its application to real-world systems? While it is beyond the scope of this paper, have you considered the safety of the sim-to-real policy in the real-world robotics task? How does it compare to previous domain randomization methods? While DORAEMON trains policies to solve a task on a maximally diverse distribution of dynamics, that are no guarantees that real-world dynamics will also be captured. However, in absence of real data and no prior knowledge on real-world dynamics, we claim that DORAEMON would likely be the best go-to option, representing an important step to solve the sim-to-real problem, which is yet to be fully solved. In particular, future work may focus on complementing DORAEMON with inference-based methods, e.g., collecting data from the real world for inferring a maximum entropy distribution that is located around high-likelihood regions of the dynamics space.