Rev5 - HackMD

# Reviewer 5 (rating: 6, confidence 3) ### Summary: The paper proposes a novel approach to addressing the "sim-to-real transfer" challenge in reinforcement learning. The presented approach, DORAEMON, aims to maximize the diversity of dynamics parameters during training by incrementally increasing randomness while ensuring a sufficiently high success probability of the current policy. This results in highly adaptable and generalizable policies that perform well across a wide range of dynamics parameters. ##### Soundness: 3 good ##### Presentation: 3 good ##### Contribution: 3 good ### Strengths: A straightforward method to explore the environment dynamics parameters is proposed for RL algorithms to enhance their generalization. The method is simple, and its effectiveness is demonstrated in the toy example and the experiments. ### Weaknesses: The method is heuristic and lacks theoretical analysis. ### Questions: - The proposed method calls the RL algorithm to update the policy for every dynamics parameters sampling, which might lead to an inefficient algorithm. Maybe the embedded RL algorithm only need to return a relatively approximate solution? Does this works for DORAEMON. This is expected to be made clear. - The authors adopt univariate beta distributions for \nu_{\phi}, which might simplify problem (6). But for more general distributions, solving (6) might be challenging. Maybe this should be further investigated. Or is it the case that some commonly used distributions, e.g. Gaussian, are effective and meanwhile make (6) tractable. Or some variational methods can be adopted. --- ##### Flag For Ethics Review: No ethics review needed. ##### Rating: 6: marginally above the acceptance threshold ##### Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. ##### Code Of Conduct: Yes --- ### OUR NOTES: --- --- --- # RESPONSE We highly appreciate the reviewer's feedback. - > The method is heuristic and lacks theoretical analysis. We agree with the Reviewer that the paper lacks theoretical analysis. As the primary goal of our work is to present a novel solution to the sim-to-real problem, we focus on a strong empirical analysis to highlight the relevance of our method to the field. In particular, while a theoretical analysis is not provided, DORAEMON introduces a technically sound optimization problem that works in a more principle manner than the previous algorithmic heuristic proposed by AutoDR---e.g., AutoDR can only work with uniform distributions and simply steps the distribution with fixed $\Delta$ increments. In our latest revision, we also provide a number of new experiments to further complement the experimental evaluation: - Added a sensitivity analysis w.r.t. hyperparameters $\epsilon$ (Sec. A.2), and $\alpha$ (Sec. A.3); - Added experiments on DORAEMON with Gaussian distributions, instead of Beta's (Sec. A.4); - Added an ablation analysis of DORAEMON (Fig. 13), and discussion on its connections to curriculum learning (Sec. B); We believe that these new analyses, together with the already thorough experimentation of the method in 6 sim2sim environments and a complex sim2real task, make our contribution highly relevant. A detailed theoretical analyses is left for a journal extension of our work. - > The proposed method calls the RL algorithm to update the policy for every dynamics parameters sampling, which might lead to an inefficient algorithm. Maybe the embedded RL algorithm only need to return a relatively approximate solution? DORAEMON does not change anything about the underlying RL subroutine, which, in fact, is not even "informed" about the change of the DR distribution. This allows to use any RL algorithm of choice, and simply call DORAEMON's optimization problem in Eq. (4) every time new $K$ trajectories have been collected by the training agent (line 3 of Algorithm 1). We can interpret this as a standard RL procedure where the learning agent is progressively presented with MDPs $\in \mathcal{U}$ of varying transition dynamics. Therefore, we shall not worry about the intermediate approximate solutions in between each iteration, as DORAEMON automatically adjusts the distribution updates according to the goodness of the current agent. In other words, the distribution won't be updated as drastically (less entropy increase, if any) when the agent only barely satisfies the minimum desired success rate, and viceversa (this phenomenon can be clearly seen in our newly introduced analysis in Fig. 10). - > The authors adopt univariate beta distributions for \nu_{\phi}, which might simplify problem (6). But for more general distributions, solving (6) might be challenging. Maybe this should be further investigated. Or is it the case that some commonly used distributions, e.g. Gaussian, are effective and meanwhile make (6) tractable. Or some variational methods can be adopted. DORAEMON can work with any family of parametric distributions, as long as the entropy (for the objective function) and the KL-divergence (for the trust region constraint) can be conveniently computed. In our experiments, we use independent Beta distributions, and therefore deal with $2$ parameters per dimensions to be optimized by our problem in (4) and (6). This results in a number of optimized parameters of $4$, $14$, $26$, $16$, $14$, and $34$ respectively for the CartPole, Hopper, Walker2D, HalfCheetah, Swimmer, and the PandaPush environments (see Tab. 2 and Tab. 3). Overall, we find our optimization problem to be particularly efficient. We computed statistics on $500$ runs of our optimization problem in the Hopper environment, and observed an average duration of $4.19 \pm 1.72$ seconds. In particular, we make use of the Scipy library and its convenient "Trust Region Constrained Algorithm"---see line 897 of `doraemon/doraemon.py` in our anonymous codebase in the supplementary material. Finally, we tested DORAEMON with Gaussian distributions and **reported the results in a novel Appendix Section A.4**. This analysis further demonstrates that our optimization problem is just as tractable with other parametric distributions, and also led to similar results.