Rev6 - HackMD

# Reviewer 6 (rating: 3, confidence 3) ### Summary: This paper presents a new domain randomization method that tries to overcome the performance and generalization gap by maximizing the entropy of the distribution of dynamic parameters while retaining certain success probability during training. The authors conduct simulation experiments as well as sim-to-real experiments. ##### Soundness: 2 fair ##### Presentation: 3 good ##### Contribution: 2 fair ### Strengths: - The article presents its content in a clear and concise manner. - The method exhibits novelty and has been well formalized. ### Weaknesses: - The baselines utilized in this study appear to be somewhat outdated, which raises the question of whether more recent advancements in domain randomization have been considered. It is highly recommended that the authors explicitly address this concern by providing a specific clarification on the existence of any updated domain randomization approaches, and it would be beneficial for the authors to incorporate additional baseline experiments that encompass these newer methodologies. - While the method proposed in this study demonstrates a certain level of innovation, it does not appear to be exceptionally groundbreaking. Therefore, it is imperative to present more compelling experimental results. Conducting additional simulation experiments, as well as sim-to-real experiments, would be highly recommended. ### Questions: - In Table 1, the accuracy of the Fixed-DR approach is notably low, prompting the need for the authors to provide further explanations, if not overlooked, in order to clarify any potential factors contributing to this outcome. --- ##### Flag For Ethics Review: No ethics review needed. ##### Rating: 3: reject, not good enough ##### Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. ##### Code Of Conduct: Yes --- ### OUR NOTES: - Gabriele: the baselines are not outdated. AutoDR is considered the state-of-the-art DR algorithm in absence of real-world data (which is the setting we consider). This is because NVIDIA took AutoDR and published a GPU-parallelized extension of it last year (https://arxiv.org/abs/2210.13702) for effective and efficient sim-to-real transfer in complex manipulation dynamics. - To the best of our knowledge, all unsupervised DR algorithms have been listed in the related works (I'll do further research though, coz why not) - Making a claim that these baselines are outdated without stating another recent work for the same setting. - DR is a vastly popular approach governed by papers that simply manually tuned the training distribution. Despite somewhat effective, it's therefore harder to find applications of it in a smarter way, as autodr public code doesn't exist, and not straightforward ways are there. The only silver lining usually lies in collecting real world data and finetuning the simulator accordingly. Unfortunately, collecting more data isn't ways an option, and we believe DORAEMON to be an important milestones for zero-shot sim2real transfer. Besides this, if real world data can be collected, we believe doramon would likely be the best way to get a zero shot data collection policy that is safest to be run real hardware for the first time. Please refer to this RECENT survey (learning from randomized simulators) for further details. --- --- --- # RESPONSE We highly appreciate the reviewer's feedback. - > The baselines utilized in this study appear to be somewhat outdated, which raises the question of whether more recent advancements in domain randomization have been considered We have carefully revised the literature during this work, and we are particularly confident on the existing works in the field of Domain Randomization (DR) for sim-to-real transfer. Despite DR being a popular approach to cross the reality gap in robotics, the vast majority of papers still carry out a "Fixed-DR" approach (often simply called UDR)---i.e. uniform boundaries are manually engineered in a back-and-forth tuning to obtain desirable performances. This likely stems from (1) the lack of public code implementations (as in the case of AutoDR), or (2) the requirement of real-world data for inference-based DR methods. Related works that follow the latter direction have been thoroughly described in Sec. 2, but crucially depart from our setting where no real-world data is available. Particularly, a thorough survey has been published just last year by F. Muratore et al. [2], carefully describing the landscape of DR approaches in literature. The survey importantly distinguises approaches that make use of real-world time series data for parameter inference (Adaptive Domain Randomization), from approaches that keep a static DR distribution (Static Domain Randomization). In absence of real-world data, AutoDR has been considered the state-of-the-art method since its release, and it has been adopted just last year by NVIDIA [1] as their go-to algorithm for sim-to-real transfer. In this newer work [1], they also still compare AutoDR with manual DR (or Fixed-DR), and demonstrate the benefits of training on a progressively wider distribution by AutoDR. As a side note, the work in [1] proposed a more efficient parallelized version of AutoDR w.r.t. the original method, and adopted manually-tuned specific step sizes $\Delta$ for each parameter dimension. As these changes simply either require more tuning or higher-end compute, we stick to the comparison with the original AutoDR method in our paper. While it received less attention, LSDR was published in IROS 2020 and proposed a novel approach to optimize DR distributions without using real-world data. However, LSDR departs from DORAEMON and AutoDR in that the M-projection of the KL-divergence is considered---yielding a maximum likelihood objective instead of a maximum entropy one---and it requires expensive Monte Carlo policy evaluations to step the current distribution. Overall, we kindly ask the reviewer to point out any newer work that has been overlooked in our work, as, to the best of our knowledge, we already compared to all most recent baselines in the field. - > While the method proposed in this study demonstrates a certain level of innovation, it does not appear to be exceptionally groundbreaking. Therefore, it is imperative to present more compelling experimental results. Conducting additional simulation experiments, as well as sim-to-real experiments, would be highly recommended. In our work, we have shown that DORAEMON is able to outperform all baselines in all tasks (in terms of maximum global success rate) and crucially demonstrates an impressive zero-shot sim-to-real transfer in our complex pushing task. In addition, DORAEMON has fewer hyparameters than the other methods (see Sec. A.2). Nevertheless, we have conducted several more experiments to provide novel useful insights on the performance of DORAEMON. In particular, we added: 1. **A novel sensitivity analysis of DORAEMON in Sec. A.2**; 2. **A novel Section A.3** to further discuss the trade-off between the in-distribution success rate vs. entropy; 3. **A novel Section A.4** with an implementation of DORAEMON with Gaussian distributions, instead of Beta's used in the main manuscript. 4. **A novel ablation analysis of DORAEMON in Fig. 13**, demonstrating the effect of the backup optimization problem in (6); We claim DORAEMON will open interesting and promising future work directions, which can ultimately provide "exceptionally groundbreaking" results in the field. Such directions include, e.g., the combination of a maximum entropy objective like DORAEMON's with inference-based methods such as SimOpt [3] or DROPO [4]. This way, policies can be learned to best generalize across dynamics parameters located around high-likelihood regions, which can be inferred from real-world data. We believe that our detailed explanations and additional results clarify the contribution of our approach and we hope that the reviewer would consider them. - > In Table 1, the accuracy of the Fixed-DR approach is notably low, prompting the need for the authors to provide further explanations, if not overlooked, in order to clarify any potential factors contributing to this outcome. The usual way of approaching domain randomization with fixed (static) distributions is extremely sensitive to the choice of the uniform boundaries. This is a well-known problem in literature, and is the main motivation for works like ours, AutoDR, and LSDR. In particular, choosing uniform boundaries that are too wide, notiously leads to performance collapse, as the learner can not deal with this much variability in the dynamics of the environment (i.e. a form of over-regularization). Interestingly, we noticed that progressively growing the training distribution as in DORAEMON, this effect is largely mitigated---e.g. compare the results of DORAEMON with Fixed-DR when the entropy converges to the maximum. This finding is in line with the experimental analysis in AutoDR's work [5]. In particular for the Pushing task case mentioned by the reviewer, almost all Fixed-DR policies were completely unable to learn any meaningful behavior, and ended up moving very slowly or simply staying still. Refer to Fig. 11 in the appendix for the full sim-to-sim results in the PandaPush task. As stated in Sec. C, we found that higher friction parameters would significantly hinder the training process, and primarily attribute this phenomenon to the poor results of Fixed-DR policies. Moreover, as the reward function includes a penalty for aggressive policy actions, Fixed-DR policies may end up learning a "do nothing" policy as a suboptimal way to deal with the severe randomization. [1] Handa, A., et al. "DeXtreme: Transfer of Agile In-hand Manipulation from Simulation to Reality." arXiv (2022). [2] Muratore, Fabio, et al. "Robot learning from randomized simulations: A review." Frontiers in Robotics and AI (2022): 31. [3] Chebotar, Yevgen, et al. "Closing the sim-to-real loop: Adapting simulation randomization with real world experience." 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019. [4] Tiboni, G. et al. "DROPO: Sim-to-real transfer with offline domain randomization." Robotics and Autonomous Systems 166 (2023): 104432. [5] Akkaya, Ilge, et al. "Solving rubik's cube with a robot hand." arXiv preprint arXiv:1910.07113 (2019). --- - > Thank you for the detailed explanation, which has addressed my initial concerns to some extent. I am willing to raise the score accordingly. --- --- --- ---  We kindly thank the Reviewer for the quick response and the increased score based on our new experiments and clarifications. Despite not having the time to set up additional sim-to-real experiments, we hope the Reviewer will consider supporting with more confidence the current version of the paper that already proves strong evidence of state-of-the-art performances in six sim2sim tasks and one sim2real task---a novel 17-dynamics parameter DR cube-pushing task whose codebase is released to the public---rendering it a relevant contribution to the field.