Rev4 - HackMD

# Reviewer 4 (rating: 6, confidence 5) ### Summary: The paper introduces domain randomization via entropy maximization, a constrained optimization framework that directly maximizes the entropy of the training distribution while retaining generalization capabilities. The authors empirically evaluate their method in several simulated control environments. Additionally, they successfully showcase zero-shot transfer in a robotic manipulation task under unknown real-world parameters, emphasizing its practical applicability. ##### Soundness: 3 good ##### Presentation: 3 good ##### Contribution: 3 good ### Strengths: The proposed framework utilizes entropy maximization to gradually enlarge the randomization range, which is sound. Experiments compared to control results demonstrate the effectiveness of the proposed methods. The robot manipulation experiments indicate that the proposed method has the potential for use in real-world tasks. ### Weaknesses: Only Beta distributions are considered in the experiments. It is encouraged to add more distribution types to the experiments. Additional visualizations need to be included to illustrate the trade-off between performance and entropy. For example, in Figure 2, the performance of Walker2D and Swimmer decreases when the entropy increases. It is important to explore the relationship between the randomized variable and performance, and to determine the range within which performance decreases. Adding more real-world experiments is encouraged. PushCube is a relatively easy task in robot manipulation. ### Questions: Can this domain randomization method potentially be applied to object randomness? For example, in ManiSkill environments, some tasks include variations in objects. Can we use maximum entropy to gradually learn from these different objects? --- ##### Flag For Ethics Review: No ethics review needed. ##### Rating: 6: marginally above the acceptance threshold ##### Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. ##### Code Of Conduct: Yes --- ### OUR NOTES: - discrete object distributions: - We are actually interested in hidden tasks where the dynamics are not observable. So let's just make it clear for the case of different objects. - it's also something that could be possible by combining different moving parts (you choose the object, then you randomize its parameters). However discrete unobservable objects would be a rather significant change. - this is an interesting point that we also added in the paper, but we left out for future works for empirical investigation. Cite where we said in the paper! --- --- --- # RESPONSE We thank the reviewer for taking the time to evaluate our work and providing insightful comments. - > Only Beta distributions are considered in the experiments. It is encouraged to add more distribution types to the experiments. While we chose Beta distributions as they can better capture the probability mass of the target uniform distribution, in principle we can use any parametric distribution whose entropy (for the objective function) and KL-divergence (for the trust region constraint) can be computed straightforwardly. In response to the reviewer's comment, we implemented DORAEMON with Gaussian distributions and report the results in the **newly added Section A.4**. We also depict a comparison of converged distributions in the two cases (Beta's and Gaussian's) in Fig. 12. - > Additional visualizations need to be included to illustrate the trade-off between performance and entropy. We provide more details on the trade-off between the performance of DORAEMON as the entropy increases, in our novel Appendix sections. In particular, we provide a thorough investigation on multiple environments (including Walker2D as suggested by the reviewer) measuring the in-distribution success rate, the entropy, and the global success rate for varying values of $\epsilon$ and $\alpha$. Such analysis sheds light on the effect of the backup optimization problem in maintaining a feasible performance constraint, and how this does not always translate to a better global success rate. Overall, we complemented the original manuscript with: 1. **A novel sensitivity analysis in Sec. A.2** where we particularly show the effects of the backup optimization problem in Fig. 8 and Fig. 9 for changes of $\epsilon$; 2. **A novel Section A.3** to discuss the trade-off between in-distribution success rate vs. entropy for the Walker2D environment (Fig. 10), as suggested by the reviewer; 3. **A novel ablation analysis of DORAEMON in Fig. 13**, demonstrating that all combinations that do not incorporate a backup optimization problem may not maintain a feasible performance constraint during training; In particular, we point out the results in Fig. 10 for the Walker2D environment, where DORAEMON steadily maintains a feasible in-distribution success rate during training, regardless of the choice of the hyperparameter $\alpha$. However, we observed that a sound tracking of the in-distribution success rate does not necessarily translate to better global success rate, which still shows a decreasing trend. We investigate this phenomenon in Fig. 9, depicting a pair of distributions found at different times during training: despite seemingly having similar entropy values according to the plot, the backup optimization problem significantly changes the distribution parameters to maintain feasible performance constraints, and in turn, moves to easier regions of the dynamics landscape. As a result, the global success rate is also affected. We further ablate the presence of the backup optimization problem in Fig. 13, and discover that its contribution does not negatively affect the global success rate, which would decline regardless if we simply kept training the policy when the performance constraint is violated (i.e. as in SPDL baselines). - > Adding more real-world experiments is encouraged. PushCube is a relatively easy task in robot manipulation. While adding more real-world experiments would certainly be beneficial to limit-test the algorithm, we believe our pushing task to be a particularly well representative testbed for our analysis, and likely more complicated than it looks at first glance. Indeed, it is easy to overlook the complexity of the task from the Domain Randomization point of view: while a simple pushing-cube task would not be considered particularly challenging in RL literature, it becomes immediately more troublesome when testing under unknown real-world parameters (e.g., center of mass, frictions), hence requiring the policy to generalize over multiple dynamics. This is easily reflected by the poor performance of an agent that randomizes parameters too heavily (Fixed-DR), and by the lack of generalization for agents that do not randomize parameters at all (No-DR). Given the complexity of setting up real world experiments, and the respective design of a representative simulation environment in the limited rebuttal period, we postpone the further investigation of DORAEMON in a variety of real-world manipulation environments for a journal extension of this work. - > Can this domain randomization method potentially be applied to object randomness? For example, in ManiSkill environments, some tasks include variations in objects. Can we use maximum entropy to gradually learn from these different objects? We thank the reviewer for this comment. As we have already stated in the final sentences of Sec. 4.1 (*"[...] the formulation is not restricted to a particular family of parametric distributions or even continuous random variables–e.g., discrete distributions over object shapes could be used."*), it is indeed possible to apply DORAEMON to random object shapes. For example, one could consider a categorical distribution over $N$ object types, with initial low entropy---i.e. most of the probability mass is assigned to a single object, and the rest is spread across the others. Then, DORAEMON can be used to progressively increase the entropy of the distribution, e.g. parametrizing it through a softmax of N continuous parameters. Ultimately, DORAEMON will attempt to converge to a uniform distribution over all N object shapes, while ensuring desirably high performance along the process. Overall, this is a rather interesting point, and opens the possibility to test DORAEMON on discrete dynamics distributions beyond object shapes. We leave this as an open direction for future work. --- --- Thank you for your comprehensive experiments and clarifications. My primary concerns have been addressed. However, the practicality and challenges of applying this method, especially in challenging robot manipulation tasks, remain uncertain without further experimentation. The authors should consider applying the method to manipulation tasks if the paper is not accepted this time. --- --- We thank the reviewer for the quick response and for considering our novel experiments and clarifications. We are extremely interested in applying DORAEMON to more challenging manipulation tasks, as we believe the method will be an important contribution for the solution of sim-to-real problems, especially for dynamics-sensitive tasks. As they are crucially time-consuming, we plan to apply DORAEMON to more sim-to-real experiments in its appropriate journal extension. For example, we plan on considering tasks with unknown obstacle shapes, and contact-rich dynamics with deformable objects. We hope the reviewer will support the current conference version that already proves strong evidence of state-of-the-art performances in six sim2sim tasks and one sim2real task---a novel 17-dynamics parameter DR cube-pushing task whose codebase is released to the public.