Rev2 - HackMD

# Reviewer 2 (rating: 3, confidence 5) ### Summary: Domain Randomization (DR) is a common technique used to reduce the gap between simulations and reality in Reinforcement Learning (RL), which involves changing dynamic parameters in simulations. The effectiveness of DR, however, largely depends on the chosen sampling distribution for these parameters. Too much variation can regularize an agent's actions but may also result in overly conservative strategies if the parameters are randomized too much. This paper introduces a new method for enhancing sim-to-real transfer, dubbed DOmain RAndomization via Entropy MaximizatiON (DORAEMON). DORAEMON is a constrained optimization framework that aims to maximize the entropy of the training distribution while also maintaining the agent's ability to generalize. It accomplishes this by incrementally expanding the range of dynamic parameters used for training, provided that the current policy maintains a high likelihood of success. Experiments show that DORAEMON outperforms several DR benchmarks in terms of generalization and showcase application in a robotics manipulation task with previously unseen real-world dynamics. ##### Soundness: 3 good ##### Presentation: 2 fair ##### Contribution: 1 poor ### Strengths: The paper is well-written and easy to follow. The authors also conducted real-world robotics experiments. ### Weaknesses: - Limited technical novelty. The formulation (Eq. 4) is very similar to that of SPRL[1], SPDL[2], CURROT[3], and GRADIENT[4]. Setting the target distribution to be uninformative -- uniform distribution could transform their objectives into something very similar to Eq.4, and they do not necessarily converge to the final target distribution. - Beta distribution is often not a reasonable choice. It cannot handle multi-modal distribution, while many existing works can handle arbitrary empirical distributions [1,2,3,4]. - Given the similarity to the existing work as discussed in the first point, the authors should compare DORAEMON to them. - Missing related work: - Klink, Pascal, et al. "Curriculum reinforcement learning via constrained optimal transport." International Conference on Machine Learning. PMLR, 2022. - Huang, Peide, et al. "Curriculum reinforcement learning using optimal transport via gradual domain adaptation." Advances in Neural Information Processing Systems 35 (2022): 10656-10670. - Cho, Daesol, Seungjae Lee, and H. Jin Kim. "Outcome-directed Reinforcement Learning by Uncertainty & Temporal Distance-Aware Curriculum Goal Generation." arXiv preprint arXiv:2301.11741 (2023). Ref: [1] Klink, Pascal, et al. "Self-paced contextual reinforcement learning." Conference on Robot Learning. PMLR, 2020. [2] Klink, Pascal, et al. "Self-paced deep reinforcement learning." Advances in Neural Information Processing Systems 33 (2020): 9216-9227. [3] Klink, Pascal, et al. "Curriculum reinforcement learning via constrained optimal transport." International Conference on Machine Learning. PMLR, 2022. [4] Huang, Peide, et al. "Curriculum reinforcement learning using optimal transport via gradual domain adaptation." Advances in Neural Information Processing Systems 35 (2022): 10656-10670. ### Questions: See weaknesses. --- ##### Flag For Ethics Review: No ethics review needed. ##### Rating: 3: reject, not good enough ##### Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. ##### Code Of Conduct: Yes --- ### OUR NOTES: - Carlo: if SPRL, SPDL, CURROT, and GRADIENT are too similar to our approach but deal with slight different settings, then this is also true for these papers themselves. If anything, we are addressing a different problem, whereas they are still all dealing with curriculum learning (a target distribution is known, a starting point is known, policies are conditioned on the context). - Gabriele: Sure, setting SPRL (1) with a target uniform, and (2) adding the backtracking mechanism would make the two opt. problems equivalent. I'm sure we can all come up with examples of papers where you can make a couple of changes and suddenly you end up with the same method of another paper, which addresses even the same problem. In our case, we are also targeting another problem. For example, DROID (https://arxiv.org/abs/2102.11003) is equivalent to SimOpt (https://arxiv.org/abs/1810.05687) if you replay real-world actions instead of rolling out the policy. Likewise, the same holds for most incremental papers. - Gabriele: Rev3 sees a major "conceptual" similarity with AutoDR, which we agree with. If both rev2 and rev3 authors were right, than this mean that SPDL/SPRL/CL methods would be conceptually similar to AutoDR. However, OpenAI and NVIDIA used AutoDR both in 2019 and 2022 and never cited any of those works, suggesting the fundamentally different problem settings. - Curriculum learning methods attempt to solve a different problem. We can try to clearly state the differences in the problem setting, such that we motivate the non-trivial pass from one to the other: - CL seeks convergence to a predefined target distribution. In our case, the goal is to generalize to the widest range of tasks, regardless of where they are located in the task space. - CL assumes that policies have access to the context at inference time. This is definitely not true for the DR setting, and it has been addressed before with ad-hoc methods (ICLR 2019: https://arxiv.org/abs/1810.05751) - DR induces a POMDP, which can benefit from memory-based policies. This problem has been addressed before by AutoDR and other papers, whereas context in CL may or may not be inferred from previous observations in general. - Pascal: - CURROT: currot has to predict the performance in every sampled task, which is technically different than what we are doing. - GRADIENT: it would grow all dimensions equivalently, which is a bit trivial. - Georgia: sure, the algorithm is similar. But it hasn't been shown for this problem, which is significantly different. Not even when AutoDR was published. - SOTA baselines in DR have limitations, and by bringing SPRL we are solving those issues. While not groundbreaking theoretically, it is an important contribution in the field. - Maybe add in the appendix a section on CL vs DR (and since it's so similar, also sell it as an ablation essentially and make the connection between the two fields): if the reviewers like it, we might as well just add this section at this point. Here, we can dive into the fact the the algorithms are very similar, and mention the fundamentally different problem we are trying to solve. Also, show the results vs. SPRL (Doraemon without backup opt. problem.) and vs. CURROT - Carlo agrees that maybe the reviewer would be more agreeable if they had known pascal is a co-author. Carlo gave very good points on how to keep this in mind when replying (see discord messages Nov 13th at 13:00). We could even state something like "We are very familiar with Pascal Klink's works. However, such application has never been shown before, and led us to apply multiple algorithmic differences, making it effectively a novel algorithm (POMDP, backup opt problem, no target is known)" - Having uninformative (uniform) targets doesn't ALWAYS mean that the SPRL becomes equivalent to DORAEMON. For example, if we use Gaussians, maximizing the entropy of the gaussian != minimizing the KL with a target uniform. There will be a target variance in the latter case, whereas the former case wants to increase the variance indefinitely. - It's not just an application, it's not trivial to apply it: history-based distribution, unfeasible dynamics that would require to backtrack (backup opt. problem). Tutto ciò che possiamo sfruttare per far capire che non è una straightforward application, li citiamo. - Unfeasible dynamics: currot has high flexibility thanks to particle-based distributions, whereas we use the backup opt. problem to move within trust regions and potentially backtrack. - Facciamogli capire che siamo esperti, e che non abbiamo problemi a dirgli di che è vero che l'algoritmo è simile. - Differences: Max entropy not always equivalent to min KL (demonstrate this for different support distributions!); history-based (and cite ICLR 21); trust region + backup opt. problem for tracking. - Forse ci siamo. SPDRL non ha una lista di cose che in doraemon abbiamo fatto e che magari non sono così ovvie (anche Beta potremmo venderla come una contribution): from gaussian to beta (Max entropy != min KL). Fare una ablation con tutte queste componenti una dopo l'altra! All'interno dello stesso grafico. - Fare tabella riassuntiva nel commento di openreview con la BEST GSR per tutti gli ablation di SPDRL, fino ad arrivare a doramon. In diverse colonne ci metti diversi environment! O anche la final GSR volendo. - Alternatively, I could also have SPDR and DORAEMON in two rows, and then the potential components as columns with ticks and marks depending on whichever one is added - "We respect the reviewer's concern of preventing plagiarism, but it couldn't be further from the truth in our case" - If the reviewers thinks we re trying to plagiarism and copy someone else s work, this couldn't be further from the truth. As a matter of fact, we ourselves are very familiar with all 4 of the works cited. Doraemon draws inspiration from the self-paced approach, but crucially departs from the problem formulation and key algorithmic details. The different problem setting and the current limitations of DR approach, make this self-paced formulation an extremely promising approach for DR that has not been attempted before by any other work (autodr is still considered state of the art and cite Nvidia) - [The crucial technical and theoretical contribution of the paper lies in understanding how to algortihmically adapt the idea of the self-paced approach to a fundamentally different problem setting (curriculum -> sim2real).] --- --- --- # RESPONSE Thank you for taking the time to evaluate our work. > Limited technical novelty. The formulation (Eq. 4) is very similar to that of SPRL[1], SPDL[2], CURROT[3], and GRADIENT[4]. Setting the target distribution to be uninformative – uniform distribution could transform their objectives into something very similar to Eq.4, and they do not necessarily converge to the final target distribution. The primary goal of the paper is to present a novel solution to the sim-to-real problem, and our empirical results make DORAEMON highly promising and relevant for the field. As a matter of fact, our method outperforms AutoDR, the currently considered state-of-the-art method for sim-to-real transfer in absence of real-world data---recently adopted by NVIDIA in [1] as their go-to DR method for sim-to-real transfer, well after self-paced methods have been introduced. To achieve such results, DORAEMON **does take inspiration from the self-paced optimization problem in the curriculum setting, but our method is *far* from a straightforward application of self-paced methods in a sim-to-real problem**, in contrast to what the reviewer's comment might reflect. To prove this, we implemented SPDL and compared it to DORAEMON in a thorough analysis that adds individual (and combinations of) components to SPDL, clearly shedding light on the differences between the two algorithms (see Fig. 13). The results demonstrate that a naive application of self-paced methods is impractical or even impossible in the domain randomization setting, and that a number of technical novelties---history-based policies, asymmetric actor-critic, and a backup optimization problem---has to be introduced to obtain desirable performance and prevent violation of the performance constraint. Overall, we report a summarized list of the key aspects where our sim-to-real problem formulation departs w.r.t. curriculum learning (CL) settings (see Sec. B for details): - **Unknown target distribution**: the knowledge of a target distribution is a strict requirement for CL methods. Setting uninformative uniform targets could be an option to bridge its formulation to a sim-to-real problem. However, this prompts the user to define the uniform boundaries as a hyperparameter of the method: while the boundaries should be designed to be as wide as possible, SPDL's practical implementation would likely suffer from uniforms that approach infinite support. This also holds for LSDR, which by definition requires a target uniform distribution to be defined. In contrast, DORAEMON drops such dependency. - **Contextual MDPs vs. LMDPs**: CL methods such as SPDL work in a contextual RL setting where a context variable that represents the current task is in principle observable and available to the learner---e.g. goal states that progressively get harder to reach. Conversely, the DR formulation induces the solution of a Latent MDP [2], as dynamics parameters may not be directly observed. Such theoretical change effectively leads to fundamental performance differences for the learning agent (see Fig. 13). - **I-projection vs. M-projection**: While SPDL cannot be directly applied to a sim-to-real problem due to the different assumptions, it shares similarities with both DORAEMON and LSDR optimization problems. Interestingly, we observe how SPDL makes use of the I-projection formulation to move the training distribution towards the target, while LSDR proposes the M-projection counterpart. In turn, SPDL may drive distributions towards maximum entropy when considering target uniforms (but is limited to parametric families of bounded support for the optimization), while LSDR can work with unlimited support distributions (but results in a maximum likelihood objective that would not grow the entropy indefinitely). Overall, DORAEMON does not rely on a KL-divergence objective, hence (1) can converge to maximum entropy uniform distributions, and (2) can be implemented with any parametric family, including unbounded ones (see the new Section A.4 for DORAEMON with Gaussian distributions). - **Backup optimization problem**: the practical implementation of self-paced methods does not allow for the entropy to decrease along the process. In turn, the rise of violated performance constrains is not explicitly addressed, and the agent simply keeps training in such occurrences. Our novel backup optimization problem introduced in DORAEMON overcomes this problem. The new empirical analysis in Fig. 13 demonstrates the effectiveness of DORAEMON to consistently move in the feasible region of the optimization landscape, whereas SPDL quickly suffers from training performance collapse. While not in-depth, these differences have also been highlighted by the recent survey on DR for sim-to-real transfer by F. Muratore et. all [3], in their Sec. 4.1. We thank the reviewer for giving us the chance to add such in-depth discussion in the paper, which would likely be helpful for future readers. Furthermore, we are currently doing the best we can to run more experiments as the rebuttal period goes on. In particular, we plan on complementing the analysis in Sec. B for the remaining sim-to-sim environments. Finally, note that GRADIENT and CURROT, despite trying to solve the same problem as SPDL, work on particle-based distributions and bring significant technical novelties in their optimization problems---namely, framing it as an optimal transport problem. While this could serve as inspiration to potentially find extensions of DORAEMON that analogously frame it as an optimal transport problem, this is certaintly out of the scope of this work and left as a future work direction. #### Beyond Beta distributions > Beta distribution is often not a reasonable choice. It cannot handle multi-modal distribution, while many existing works can handle arbitrary empirical distributions [1,2,3,4]. DORAEMON can work with any family of parametric distributions for which their entropy (for the objective function) and KL-divergence (for the trust region constraint) may be conveniently computed, just like SPDL. In fact, **we tested DORAEMON with Gaussian distributions and report the results in the newly added Section A.4.** We believe that our detailed explanations and additional results clarify the contribution of our approach and we hope that the reviewer would consider them for a thorough re-assessment of our work. #### Missing related works > Missing related work: [...] We included a citation of the aforementioned works in our latest revision: CURROT and OUTPACE in Sec. 2, and GRADIENT in Sec. B. [1] Handa, A., et al. "DeXtreme: Transfer of Agile In-hand Manipulation from Simulation to Reality." arXiv (2022). [2] Chen, Xiaoyu, et al. "Understanding Domain Randomization for Sim-to-real Transfer." ICLR (2021). [3] Muratore, Fabio, et al. "Robot learning from randomized simulations: A review." Frontiers in Robotics and AI (2022): 31. --- --- --- # RESPONSE BY REVIEWER --- Thank the authors for the detailed response. However, my concerns regarding the technical novelty (w.r.t self-paced RL) and limitation induced by the parametric family still hold. First, I think beta distribution is also bounded, and you need to choose the boundaries as well. I understand that the proposed method also works with unbounded parametric distribution if KL and entropy can be computed efficiently, in principle. However, the original choice of distribution is not convincing. Second, it is not obvious to me why "Such theoretical change (LMDP vs. CMDP) effectively leads to fundamental performance differences for the learning agent." I think the authors could provide more theoretical justification, or maybe the authors could give more detailed insight into Chen, Xiaoyu, et al. Third, as discussed in the rebuttal, being a parametric family of distribution is not necessarily an advantage. The added experiment with Gaussian distributions does not address my concern: it is still uni-modal, and any assumptions on the underlying distribution may limit the flexibility of the proposed method and thus the final effects. --- --- Notes: - Have you even read section B? Have you even seen Fig. 13? - For the last point, we can say we're already beating AutoDR and LSDR in the same setting where AutoDR not only works on parametric distributions, it can only work with UNIFORMS! We add so much more flexibility already, how come the paper should be rejected because we go from uniform to all parametric families, and not particle-based distributions? - For the second point, we literally show this with 6 figures and X number of new experiments. Point them to it and make them understand the Figure. - So our paper should be rejected because we don't consider particle-based distribution, essentially that's what the reviewer is saying. --- --- # RESPONSE BY US We thank the reviewer for the quick response. - > However, the original choice of distribution is not convincing. We compare our method with LSDR and Fixed-DR baselines that by definition require a fixed target uniform distribution, as they aim to achieve the best generalization on it. As it wouldn't be fair to assess their generalization capabilities on distributions different than their corresponding target, we therefore benchmark all methods in terms of the success rate achieved on the same target distribution, including those methods which do not necessarily require boundaries (AutoDR and DORAEMON). This is done in order to diminuish the number of external factors affecting the comparison. In other words, all methods can only sample parameters within the target distribution, making it a much fairer comparison. Furthermore, having a reference distribution for generalization makes it possible for us to benchmark the methods in the first place: how would we compute the ability of the methods to generalize across dynamics otherwise? A target, reference, distribution needs to be defined for the sake of comparison, and to allow for different methods to move within the same space of dynamics parameters. For this reason, DORAEMON has originally been tested with Beta distributions, and AutoDR---which can only work with uniform distributions---is prevented from growing the boundaries wider than the target distribution in our implementation. Beyond this analysis, we then demonstrate that (unbounded) Gaussian distributions may also be used for DORAEMON and achieve the same performances as in the Beta implementation (see Fig. 11), highlighting that the method's superior performance does not result from the particular choice of the distribution. Overall, our thorough empirical analysis clearly shows that (1) our method can in principle work with any family of parametric distributions (and in particular, Beta's and Gaussian's), and that (2) this particular design choice did not significantly affect the results in our case, attributing the superior performance to the remaining components of the algorithm. We indeed remind that LSDR is also implemented with Gaussian distributions. - > Second, it is not obvious to me why “Such theoretical change (LMDP vs. CMDP) effectively leads to fundamental performance differences for the learning agent.” We are happy to provide more details on this, as we took the effort to write an in-depth novel Section B with a thorough ablation analysis and comparison with self-paced methods, in response to the reviewer's concerns. The contextual RL setting in curriculum learning often provides observable variables that determine the task difficulty, hence one may train policies conditioned on the task parameters explicitly. For example, a curriculum over goal states allows for policies to be trained with input knowledge on the current goal. The partially-observable setting of Domain Randomization, however, completely prohibits this: dynamics parameters shall be considered latent variables of the underlying distribution of MDPs, and are unobservable and unknown at test time. This problem setting has been recently formulated with the introduction of Latent MDPs in [3]. We provide clear empirical evidence of this problem in Fig. 13: the SPDL-Oracle policies $\pi(a|s,\xi)$ are conditioned on the sampled dynamics parameters $\xi$, while SPDL reflects the performance of a simple Markovian policy $\pi(a|s)$. The evident performance difference between the two approaches, demonstrates that a naive application of self-paced methods to Domain Randomization would perform incredibly poorly. We then discuss, based on recent theoretical studies [3], how Latent MDPs can be dealt with by conditioning policies on the history of state-action pairs previously experienced. This allows policies to get information on the dynamics of the current MDP, and act accordingly, despite not knowing the dynamics parameters directly (i.e. a form of implicit system identification can occur). Despite the sound thoretical ground provided by [3], our experiments in Fig. 13 yet do not show significant improvements when adding history alone. We attribute this phenomenon to the higher complexity of the state space, which now considers a much higher number of dimensions (we consider 5-timestep histories). Overall, we find that the best combination can be achieved when adding both history and the asymmetric actor-critic paradigm. Unfortunately, such integrations to SPDL still do not ensure that policies move within a feasible landscape of the constrained optimization problem---see how the in-distribution success rate in Fig. 13 falls in the unfeasible region below $\alpha$, and keep a steady decreasing trend for all SPDL baselines. We therefore notice how the backup optimization problem defined in Eq. (6) allows DORAEMON to both retain generalization capabilities and maintain the desired performance threshold with impressive consistency across multiple environments, multiple trust region sizes, and multiple values $\alpha$ (see Fig. 10 for the latter claim). - > Third, as discussed in the rebuttal, being a parametric family of distribution is not necessarily an advantage. The added experiment with Gaussian distributions does not address my concern: it is still uni-modal, and any assumptions on the underlying distribution may limit the flexibility of the proposed method and thus the final effects. The reviewer opens a concern about the limited flexibility of DORAEMON. Yet, DORAEMON demonstrates that it can work with any family of parametric distributions, which is far better than the current state-of-the-art method AutoDR that is limited to Uniform distributions **only**. LSDR, on the other hand, could work with any parametric family, but it has only been shown with Gaussian distributions or simple discrete distributions. Furthermore, the reviewer statement *"Beta distribution is often not a reasonable choice. It cannot handle multi-modal distribution, while many existing works can handle arbitrary empirical distributions [1,2,3,4]."* is technically wrong: [1] and [2] make the same parametric assumptions on the distribution as DORAEMON and, as they require the minimization of a KL-divergence objective, a closed-form solution is likely needed to perform the optimization. In turn, this makes [1] and [2] impractical for multi-modal distributions for which the KL cannot be computed in closed-form (e.g. mixtures of Gaussians) or approximated easily. In fact, for multi-modal distributions for which KL computation is possible, Doraemon could readily be used as well. To conclude, the reviewer essentially converges to the point that DORAEMON is not mature enough because multi-modal distributions may not be captured, despite all considered baselines in the field completely disregard this limitation, or even have stricter assumptions than us (i.e. AutoDR). While more flexible representations could provide benefits to the algorithm, it is unclear why this point is presented as a primary weakness to the method: DORAEMON introduces a technically sound constrained optimization problem with strong empirical evidence that superior sim-to-real transfer can be obtained vs. state-of-the-art methods, while allowing for more distribution flexibility than current methods in the field, and fewer hyperparameters---rendering it a highly relevant contribution for sim-to-real problems. Ref: [1] Klink, Pascal, et al. "Self-paced contextual reinforcement learning." Conference on Robot Learning. PMLR, 2020. [2] Klink, Pascal, et al. "Self-paced deep reinforcement learning." Advances in Neural Information Processing Systems 33 (2020): 9216-9227. [3] Chen, Xiaoyu, et al. "Understanding Domain Randomization for Sim-to-real Transfer." ICLR (2021).

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.