Reviewer 3 (rating: 6, confidence 3)

# Reviewer 3 (rating: 6, confidence 3) ### Summary: The paper introduces DORAEMON, which revisits domain randomization from the perspective of entropy maximization. Specifically, instead of maximizing the expected total return over the distribution of the dynamics parameters, this paper proposes a constrained optimization problem that directly maximizes the entropy of the dynamics distribution subject to a constraint on the success probability. Based on this formulation, this paper proceeds to offer an algorithmic implementation that decouples this maximization into two subroutines: (1) Update the policy by any off-the-shelf RL algorithm under the current dynamics parameter; (2) Under the current policy, update the dynamics parameter to improve the entropy with the help of a KL-based trust region. Accordingly, a toy experiment is provided to demonstrate the dynamics distribution that DORAEMON converges to. The proposed algorithm is then evaluated on both sim-to-sim (MuJoCo) and sim-to-real tasks (PandaPush) against multiple baseline DR methods. ##### Soundness: 3 good ##### Presentation: 3 good ##### Contribution: 3 good ### Strengths: The method introduced in this paper is quite intuitive and reasonable in concept and avoids some inherent issues of DR. Specifically, as the standard DR requires a pre-configured fixed prior distribution over the support of the environment parameter (which would require some prior domain knowledge), the proposed DORAEMON framework learns to maximize the entropy of dynamics distribution and hence naturally obviates this issue. (That said, in the meantime, the threshold needed for defining a successful trajectory also requires some domain knowledge, but probably a bit less) The proposed algorithm is evaluated in a variety of domains (including both sim-to-sim and sim-to-real scenarios), and the empirical results demonstrate quite promising performance of the DORAEMON framework (in terms of success rate). The paper is well-written and very easy to follow, with justification and explanation whenever needed in most places. ### Weaknesses: Overall I could appreciate the proposed reformulation of DR, but there are some concerns regarding the algorithm: - DORAEMON appears to be conceptually very similar to the AutoDR, or ADR in the original paper (Akkaya et al., 2019). They both define some custom indicators of success and iteratively increase the entropy of the dynamics distribution. With that said, DORAEMON appears to be yet a somewhat different implementation of the idea highlighted by ADR. - Based on the above, while the two approaches arise from similar ideas, DORAEMON appears to have a better success rate across almost all environments. It is not immediately clear whether the performance improvement comes from which specific part of the design or it is just a matter of different choices of hyperparameters. While there is a one-sentence discussion on the authors’ conjecture in Section 5.2 (about the potential data inefficiency), it is expected to have a deeper dive into the root cause of this performance difference. - The successful rate for certain tasks, e.g., Walker in Figure 2 and PandaPush in Figure 11, decline after reaching maximum entropy. However, the algorithm does not dynamically reduce the entropy in response to a decrease in the success rate, which might be necessary for maintaining performance consistency. This appears not consistent with the objective in (4). As the discussion in Section 5.2 does not fully address this phenomenon, more explanation would be needed. - Another concern lies in the constraint based on the success rate. Specifically, the use of success rate largely ignores the effect of the poor trajectories, which could be arbitrarily poor and degrade the robustness of the learned policy. By contrast, in the standard DR, the objective is to consider the expected total return over all the possible trajectories. As the experimental results reported in the paper all focus on the “success rate”, the robustness issue is thereby largely neglected. ### Questions: Detailed comments/questions: - In practice, is it computationally easy to optimize (4)? The constrained problem does not seem to be convex (even under Beta distribution)? - Could the authors specify the alpha values for Figure 2 and Figure 11? When the entropy matches the max entropy, the global success rate aligns with the local success rate. If the alpha is set to 0.5, why does the global success rate drop below 0.5 when entropy is at its maximum? - How to design the entropy jointly for multiple dynamics parameters? (For example, simply taking the product of multiple univariate distributions like AutoDR?) - In Section 3: The notation of reward function shall be consistent (mathcal or not) ---- ##### Flag For Ethics Review: No ethics review needed. ##### Rating: 6: marginally above the acceptance threshold ##### Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. ##### Code Of Conduct: Yes --- ### OUR NOTES: - AutoDR and DORAEMON are conceptually similar, but veeery different in how they try to solve the problem. Besides the whole algorithm being different, here's a list of other non-trivial observations (see doc for the full list): - ADR has an additional hyperparameter to handle "backtracking" (it requires a low-reward threshold, together with a task-solved threshold) - ADR is not flexible, you always step the distribution by "delta". DORAEMON can move slower or faster, such that constraints are met - Less efficient use of samples: ADR can only update one dimension at a time. One boundary trajectory can only be used to update that boundary. Also, sometimes data is discarded altogether when it's neither above nor below the reward thresholds - ADR can use uniform distributions only (no correlations, no multimodality, no other shapes). This also prevents the algorithm to train on problematic parameters, as they can only experience these if the performance is good enough. I remember Pascal saying that some work was showing that is actually beneficial to experience some unfeasible or problematic contexts while training. So in this sense, ADR is too conservative. - median vs avg: by having this formulation, we can easily allow custom success notions to be defined. E.g. the plane environment sets a success if more than 25 ts etc., achieving this with an average timestep being above > 25 is ill-posed and does not serve the scope. Therefore, we view this formulation as more general. Of course, if you go back to using a threshold return as a success notion, this simply means that you consider the median instead of the average. BUT, you could always change this implementation detail if you really wanted to, we experimented with it and found this to be more flexible. - we now report the alpha sensitivity for walker as well, which shows how the entropy seemingly stays the same, but the global success rate goes down. this has to do with the distribution shrinking in entropy and moving to easier regions. At the same time, though, this allows the in-distributions success rate to be maintained. more sophisticated way to avoid catastrophic forgetting will be inspected in future works. --- --- --- # RESPONSE We thank the reviewer for their valuable feedback. We further discuss the raised concerns by the reviewer. - > It is not immediately clear whether the performance improvement comes from which specific part of the design or it is just a matter of different choices of hyperparameters. While there is a one-sentence discussion on the authors’ conjecture in Section 5.2 (about the potential data inefficiency), it is expected to have a deeper dive into the root cause of this performance difference. As the reviewer readily pointed out, DORAEMON builds on the same intuition as AutoDR to progressively increase the entropy of the training distribution, but follows a drastically different algorithm to do so. In particular, let us break down the fundamental differences between DORAEMON and AutoDR: 1. **Uniforms distributions only**: by definition, AutoDR is limited to a training distribution parametrized as uniform. This is a major limitation of the algorithm, which may not capture correlations, nor multimodal effects, nor unbounded dynamics parameters. In contrast, DORAEMON's optimization problem can work with any family of parametric distributions. To support this claim with further evidence, **we added experiments on DORAEMON with Gaussian distributions in the newly introduced Section A.4**. 2. **Fixed step size**: AutoDR may only step the distribution boundaries by a fixed step size $\Delta$, which considerably limits the flexibility of the algorithm. This resulted in a much higher variance of the entropy curves across seeds (see Fig. 2). Instead, DORAEMON leverages the solution of an explicit optimization problem to step the current distribution while remaining within a trust region in KL-divergence space. In turn, DORAEMON can dynamically step all dimensions of the distribution as much as needed to maintain a feasible performance constraint. A recent work by NVIDIA [1] implemented AutoDR with per-dimension $\Delta$ values, which attempts to provide more flexibility at the expense of drastically more hyperparameter tuning. 3. **Inefficient use of training samples**: AutoDR does not leverage all training data as information to update the distribution. Rather, it alternates pure training samples from the uniform distribution (50%) to dynamics parameters sampled at the boundaries (50%). While all are used to update the current policy, only the latter samples provide information on the success rate at the boundary to step the distribution. Moreover, these samples may even be discarded if the success rate is not sufficiently high to increase the entropy of the distribution---see $CLEAR(D_i)$ instruction in Alg. 1 of [2]. Conversely, DORAEMON uses all training samples of the current iteration to step the distribution (never discards them), and does not bias the sampling process to occur at the boundaries half the time. 4. **More domain knowledge for backtracking**: AutoDR introduces an additional low return threshold $t_L$ to shrink the entropy of the distribution when performances are too low. This mechanism is analogous to the backup optimization problem introduced by DORAEMON, despite our method not requiring an additional hyperparameter, and relying on a more flexible, explicit optimization (for the same reasons as in the **"Fixed step size"** point). The original paper [2] simply sets $t_L$ to be half the high-return threshold $t_H$, but we suspect that this hyperparameter should likely be tuned with task-specific domain knowledge. - > The successful rate for certain tasks, e.g., Walker in Figure 2 and PandaPush in Figure 11, decline after reaching maximum entropy. However, the algorithm does not dynamically reduce the entropy in response to a decrease in the success rate, which might be necessary for maintaining performance consistency. We agree that the limited explanation in the main manuscript makes it hard to fully understand this phenomenon, despite we tried our best to deliver the maximum content given the page limit. In our updated manuscript, and in response to the reviewer's comment, we delve into the details of this phenomenon in the Appendix: 1. **We added a novel sensitivity analysis in Sec. A.2** where we particularly show the effects of the backup optimization problem in Fig. 8 and Fig. 9 for changes of $\epsilon$; 2. **We added a novel Section A.3** to discuss the trade-off between in-distribution success rate vs. entropy for the Walker2D environment (Fig. 10), as suggested by the reviewer; 3. **We added an ablation analysis of DORAEMON in Fig. 13**, demonstrating that all combinations that do not incorporate a backup optimization problem may not maintain a feasible performance constraint during training; In particular, we point to the results of Fig. 10 for the Walker2D environment: notice how DORAEMON can steadily maintain a feasible in-distribution success rate during training, regardless of the choice of the hyperparameter $\alpha$. However, we observed that a sound tracking of the in-distribution success rate does not necessarily translate to better global success rate, which still shows a decreasing trend. We investigate this phenomenon in Fig. 9, depicting a pair of distributions found at different times during training: despite seemingly having similar entropy values according to the plot, the backup optimization problem significantly changes the distribution parameters to maintain feasible performance constraints, and in turn, moves to easier regions of the dynamics landscape. As a result, the global success rate is also affected. We further ablate the presence of the backup optimization problem in Fig. 13, and discover that its contribution does not negatively affect the global success rate, which would decline regardless if we simply kept training the policy when the performance constraint is violated (i.e., as in SPDL baselines). - > Another concern lies in the constraint based on the success rate. We believe this point to essentially be a matter of design choice. Our formulation makes use of the success rate to allow for custom success notions to be defined, hence more flexibility. For example, consider the inclined plane task in Sec. 4.2: a succesful trajectory is defined as such when the cart is balanced around the center of the plane for longer than 25 timesteps. An average (expected) return formulation would be impractical in this case. Furthermore, a success notion allows the designer to encode task-specific toleration to errors in the problem. Overall, DORAEMON may still be used with success notion that are simply defined as return threshold (as in our sim-to-sim analysis in Sec. 5.2), which would be equivalent to consider a *median* return formulation as opposed to the *average* return formulation suggested by the reviewer. This likely makes DORAEMON's optimization problem less affected by catastrophic returns (which would not affect the median performance), which we argue could be beneficial. Nevertheless, the importance sampling estimator in Eq. (5) can be easily adjusted to consider an average return performance constraint. - > In practice, is it computationally easy to optimize (4)? The constrained problem does not seem to be convex (even under Beta distribution)? DORAEMON is particularly efficient. We computed statistics on $500$ runs of our optimization problem (4), and observed an average duration of $4.19 \pm 1.72$ seconds. As we run (4) iteratively $150$ times for each training session---$15$M timesteps, with $100000$ training timesteps in between each iteration---this results in approx. $10$ minutes of computational power, w.r.t. a total training time of about $16$ hours. In particular, we make use of the Scipy library and its convenient "Trust Region Constrained Algorithm"---see line $897$ of `doraemon/doraemon.py` in our anonymous codebase in the supplementary material. - > Could the authors specify the alpha values for Figure 2 and Figure 11? When the entropy matches the max entropy, the global success rate aligns with the local success rate. If the alpha is set to 0.5, why does the global success rate drop below 0.5 when entropy is at its maximum? We use $\alpha=0.5$ across all our experiments. The only parameter that is specifically tuned for each environment is the trust region size (see Sec. A.2 for more information). Please refer to our comments above in this post for a detailed explanation on why the global success rate may decrease despite the entropy **appears** to be at its maximum. - > How to design the entropy jointly for multiple dynamics parameters? (For example, simply taking the product of multiple univariate distributions like AutoDR?) Correct. As we work with univariate Beta distributions, the product of these PDFs is considered for the joint distribution whose entropy needs to be computed. We then simply sum the univariate distribution entropies---a convenient property of the differential entropy when dealing with independent distributions, which can be easily verified. - > In Section 3: The notation of reward function shall be consistent (mathcal or not) Thank you for spotting the typo. We fixed the notation. [1] Handa, A., et al. "DeXtreme: Transfer of Agile In-hand Manipulation from Simulation to Reality." arXiv (2022). [2] Akkaya, Ilge, et al. "Solving rubik's cube with a robot hand." arXiv (2019).