ICLR'22 DiffSkill rebuttal

# ICLR'22 DiffSkill rebuttal ## General Response We would like to thank the reviewers for their thoughtful feedback. We are glad to see that reviewers generally appreciated our paper: rationality and novelty of our method (reviewer H1QF, DHQA, wVGi, ifUw), difficulty of the tasks our method can solve (reviewer H1QF, DHQA, wVGi, ifUw), contribution to the robotics community (reviewer H1QF, DHQA), generality of our method to different tasks (wVGi, H1Qf) and clear presentation (reviewer H1Qf, DHQA). In addition to the response to specific reviewers, here we summarize the added experiments and revision to the paper: **[New experiments and discussion]** - Comparison with an additional model-free RL baseline, i.e. SAC (Haarnoja et al. 18') in Table 1. - Comparison of DiffSkill with Traj Opt and RL baselines on single-tool tasks in Appendix B (Reviewer ifUw) - Comparison of pre-training VAE versus training VAE jointly with other losses in Appendix C (Reviewer ifUw) - Our website (https://sites.google.com/view/iclr2022diffskill) is updated with visualization of baseline methods. (Reviewer H1Qf, ifUw) - Add discussion of the limitation and future directions of our work in the conclusion section. (Reviewer H1Qf, wVGi) We hope our responses have convincingly addressed all reviewers’ concerns. We thank all reviewers’ time and efforts! Please don’t hesitate to let us know of any additional comments on the manuscript or the changes. ## R4 (Reviewer ifUw) Thank you for your detailed feedback. We are glad that you find our method "insiprational". Below, we address each of your concerns in the weakness section with new experiments and clarifications. > Is a skill dependent on the tool alone or the observations together as well? Specifically, if you have different start/end positions of the same short-horizon task using only one tool, are they considered the same skill? We define separate skills based on the tool alone and not the observations. In our experiments, a single skill is applied to different observations. For example, in the LiftSpread task, the skill of using the rolling pin can be applied to either when the dough is on the right or when the dough is on the cutting board. The same skill can also generalize to different environment configurations where the shape of the dough or the start/end positions of the tool are different. > Why not training the VAE alone? Generally speaking, this module should be standalone. Thank you for the great suggestion. We have performed an additional experiment comparing two approaches: one is our method in the initial submission which jointly trains the VAE along with other modules and the other is to train the VAE first and then freeze the VAE encoder when training other modules. The planning performance on different tasks are shown below (Normalized Performance / Success Rate): | | LiftSpread | GatherTransport | CutRearrange | ---------- | ----------- |----------- |----------- | | Joint Training | 0.450 / 100% | 0.663 / 60% | 0.367 / 20% | Pre-train VAE | 0.438 / 100% | 0.654 / 60% | 0.450 / 40% We find that pre-training VAE provides a slight performance gain, although it takes longer to train overall since we need to train the VAE first. We will include this comparison in the final version of our paper. > How do you determine H in each long-horizon task? How does different H in one task impact the performance? Currently, H is manually specified according to how many stages each task may take. For example, the LiftSpread task requires a two-stage execution of first lifting up the dough and then spreading the dough. Using smaller H leads to complete failure of the task, since the innate nature of the challenging tasks require multiple stages of execution with different tools. On the other hand, increasing H unnecessarily increases the difficulty of planning, and indeed we observe a decrease in the performance with larger H. Planning over more steps would be an interesting direction for future work. We have added a discussion on this point in the conclusion section. > The method exhaustively search over all combinations of different skills in each small step. Does it mean that in some specific configurations where H and num_skills are large, the proposed method can actually be even slower than the trajectory optimization provided by differentiable physics? I would like to see more discussions or experiments regarding this matter. First, we want to clarify the differences between our method (DiffSkill) and the gradient-based trajectory optimization (GBTO): GBTO requires the full state information in the simulator while DiffSkill directly takes RGBD images as input and thus DiffSkill can be applied to the more general case where the full states are not known. This is important because estimating the full state information in the real world can be very challenging. Additionally, as shown in Table 1, DiffSkill is able to solve the long-horizon tasks while GBTO cannot. Due to these two reasons, merely comparing the computation speed between DiffSkill and GBTO does not tell the full story. Nevertheless, we conduct experiments on LiftSpread (With two skills) and the planning time for different plan steps are shown below. | Plan step | 1 | 2 | 3 | 4 | | ---------- | ----------- |----------- |----------- |----------- | | Planning time (s) | 11 | 31 | 107 | 223| The planning time for DiffSkill does grow exponentially when H increases (on the other hand, GBTO cannot solve this task at all). We have included this point in the limitation section. A potential future direction is to incorporate a policy or value function for more efficient planning. To reiterate, even though GBTO could theoretically be faster given sufficiently large H, GBTO does not solve the discrete planning problem of which tool to use and thus would not solve any long-horizon multi-tool task directly. > How did DiffSkill perform on single tool experiments? We have conducted experiments comparing DiffSkill with the RL baseline on single tool tasks in Appendix B and we found that DiffSkill can robustly complete these tasks. Additionally, the RL baseline we compare to can also solve the easier task of lifting. > Why Trajectory Opt has 0.544 improvement and 0% success rate in 'Tool A Only' while the numbers changed to 0.385/20% in 'Multi-Tool'? Aren't they positively correlated? You are correct, thank you for pointing this out！ This specific entry for Trajectory Opt with Multi-tool for the GatherTransport task has a typo and the score should be 0.503. We have checked the rest of the entries carefully and there are no other typos in our results. We have updated this entry in the latest version of our paper. For completeness, we pasted here the raw performance of Trajectory Opt using either Tool A only or Multi-tool for each of the 5 trials. Multi-tool has a 20% success rate as its performance surpasses the threshold of 0.65 on trial 5. | | trial 1 | trial 2 | trial 3 | trial 4 | trial 5 | average | | ---------- | ----------- |----------- |----------- |----------- |----------- |----------- | | Tool A only| 0.4927 |0.5312 |0.564 | 0.5046 |0.6255 | 0.544 | Multi-tool | 0.4197 |0.4357 |0.4405 | 0.5679| 0.6535 | 0.503 > It seems that all previous baselines somewhat fails in the experiments designed. I wonder how they would perform in simpler ones where they actually have achieved something meaningful (i.e. success rate>0), and how the proposed method would compare in these examples. All baselines fail due to the challenges presented by the long-horizon manipulation tasks. The updated videos on our website may help with understanding how the baselines fail. To make our comparison with RL baselines more comprehensive, we compare with an additional model-free RL baseline on our multi-stage tasks: Soft Actor-Critic (SAC) [1]. The results are updated in Table 1 of the revised paper. We can see that SAC performs better than TD3 but is still much worse compared to DiffSkill. Videos of how the SAC baseline fails can be found on our website. We can see that SAC is able to perform reasonable actions but also get stuck in local optima due to the long-horizon nature of the task. To further demonstrate the correctness of the implemented SAC, we compare with SAC on single-tool, single-stage tasks in Appendix B and we can see that SAC is able to solve the Lift task very well. Videos for the single-tool tasks are also updated on our website. [1] Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." International conference on machine learning. PMLR, 2018. **We hope that our response has addressed your concerns, and turns your assessment to the positive side.** *If you have any additional questions, please feel free to let us know during the rebuttal window.* Best, Authors ## R3 (Reviewer wVGi) Thank you for your helpful feedback. We address each of your concern below: > Correct the typos In Equation 2, should it be maximizing ? and should C(k, z) be negative or positive? Thank you for pointing out this typo. We have corrected Equation 2 in the updated version of the paper. > When optimizing zi in equation (2), is it efficient/correct to first treat the problem as unconstrained and then project variables back to the constraint set? Can you directly solve it as a constrained optimization? - Projected gradient descent (PGD) is a common approach for solving constrained optimization. When the objective function is convex and $\beta-smooth$ on the constraint set that is also convex, PGD will converge to the optimal solution (correctness) and have the same convergence rate as the unconstrained case. We refer to Section 3.1, 3.2 and Theorem 3.7 of the book Bubeck15: http://sbubeck.com/Bubeck15.pdf. - We use PGD instead of other constrained optimization methods since projection onto an l2 constrained set is simple and straightforward. - Additionally, we note that PGD has also been used with neural network functions in other works, such as in the case of adversarial attacks [1]. [1] Madry, Aleksander, et al. "Towards deep learning models resistant to adversarial attacks." ICLR 2018. > What could be the possible limitation and future work of this method. We have updated the conclusion section of our paper for a discussion of the limitation and future work. We pasted it below: - There are a few interesting directions for future work. First, currently DiffSkill uses exhaustive search for planning over the discrete space. As the planning horizon and the number of skills grow larger for more complex tasks, exhaustive search quickly becomes intractable. A potential solution is to incorporate a policy or value function for more efficient planning. Second, similar to many other data-driven methods, while neural skill abstraction gives good prediction results on scenarios where a lot of data are available, it performs worse when tested on situations that are more different from training. This can be remedied by either collecting more diverse training data in simulation, for example, under an online reinforcement learning framework, or by using a more structured representation beyond RGBD images, such as using an object-centric representation. Third, we hope to extend our current results to the real world, by using a more transferrable representation such as just a depth map or a point-cloud representation. Finally, we hope to see DiffSKill be applied to other similar tasks, such as those related to cloth manipulation. *We sincerely appreciate your comments. Please let us know if you have further feedback.* Best, Authors ## R2 (Reviewer H1Qf) Thank you for your insightful comments and for finding our work "novel", "beneficial for the robotics community". We address each of your concern below. > For the optimization problem in (2), shouldn't you maximize the cost function C(k,z) instead of minimizing it? Thank you for pointing out this typo. We have corrected this typo in Eqn. 2 in the revised version of the paper. > A video comparison of the resulting policies to the chosen baselines would be nice (RL, BC and trajectory optimizer). This could illustrate where the other methods fail. Please see our updated website for videos of the baselines. The baselines are unable to complete the task as they get stuck in a local optima. > What are the possible limitations of this approach in practice? We have updated the conclusion section of our paper for a discussion of the limitation and future work. We also pasted it below: - There are a few interesting directions for future work. First, currently DiffSkill uses exhaustive search for planning over the discrete space. As the planning horizon and the number of skills grow larger for more complex tasks, exhaustive search quickly becomes infeasible. An interesting direction is to incorporate a heuristic policy or value function for more efficient planning. Second, similar to many other data-driven methods, while neural skill abstraction gives good prediction results on region where a lot of data are available, it performs worse when tested on situations that are more different from training. This can be remedied by either collecting more diverse training data in simulation, for example, under an online reinforcement learning framework, or by using a more structured representation beyond RGBD images, such as using an object-centric representation. Third, we hope to extend our current results to the real world, by using a more transferrable representation such as just a depth map or a point-cloud representation. Finally, we hope to see DiffSKill be applied to other similar tasks, such as those related to cloth manipulation. > Is the goal representation as an RGB-D image practical for example for real robot applications? It is difficult to directly transfer our current RGB-D models to the real robot, since our simulator is not photo realistic and we are not rendering any robot arm. There are two potential directions for better sim2real transfer. The first way is to make the rendering in the simulator more photorealistic, or use domain randomization [1]. The second way, as discussed in the conclusion section, is to use a more transferrable representation such as just a depth map or a point-cloud representation [2]. [1] Tobin, Josh, et al. "Domain randomization for transferring deep neural networks from simulation to the real world." IROS 2017. [2] Lin, Xingyu, et al. "Learning Visible Connectivity Dynamics for Cloth Smoothing." CoRL 2021 > How critical is resetting tools to initial poses at the end of each skill policy execution? We apply this reset procedure since it is easy to do both in simulation and on real robots, and generally simplifies the planning problem. Without the reset procedure, the planner would also need to reason about collision of the tools with each other and plan extra steps to avoid such collision. *We sincerely appreciate your comments. Please let us know if you have further feedback.* Best, Authors ## R1 (Reviewer DHQA) Thank you for your positive feedback. We are glad to hear that you find our work to be "a great contribution to the robotic manipulation literature". We address each of your concerns below. > W.1 Maybe this is just me but the writing could be a bit more polished for accessibility. It took me 2 reads for it to "click" that when you write "neural skill abstractor" in the abstract and intro, what you mean is that you're learning image-based control from a state-based ("expert") policy with the purpose of applying this in an image-based setting. I think just adding a sentence or two to explain this or create more of an intuition why we need to learn this would go a long way. The same is true imho for the fact that you're discovering intermediate goals via searching over z. Thank you for your writing suggestion. We have updated the corresponding text in abstract and introduction in our revised version, highlighted in red. > W.2 There's a couple typos in the main body of the text but I'm sure you can fix those. Also in the first sentence, calling elder care "deformable object manipulation" is... interesting. We have fixed all the typos that we found and clarified the elder care example in the introduction. To also clarify here, the task of elder care involves assistive dressing, feeding or bed bathing~[1], which involve manipulation of fabrics, food, or lifting and cleaning the human body, all of which are deformable/non-rigid. [1] Erickson, Zackory, et al. "Assistive gym: A physics simulation framework for assistive robotics." ICRA 2020. > When mentioning differentiable physics and planning therein, it would be nice to also mention the recent work on GradSim (Jatavallabhula, Krishna Murthy, Miles Macklin, Florian Golemo, Vikram Voleti, Linda Petrini, Martin Weiss, Breandan Considine et al. "gradSim: Differentiable simulation for system identification and visuomotor control." ICLR 2021.) Thank you for the reference. We have added this citation. > Why did you decide to train all network components simultaneously with one big loss term? Did you try training them separately? Are there any upsides/downsides to that? * The reason for training them with a single loss is that, the feasibility predictor, reward predictor and the VAE model in our framework actually share the same encoder. As such, losses for training each prediction head can affect each other. The architecture is described more formally in Section 3.3 of the paper, under the variational auto-encoder block. * Sharing the encoder allows us to do planning more efficiently. In DiffSkill, during planning, we search the intermediate goal images in the latent space $\bf{z}$, as seen in Eqn. 2. Sharing the encoders among different modules means that the latent space is also shared. This enbles us to directly evaluate the feasibility and reward of each plan using the latent vectors as input, instead of first decoding them into images. *We sincerely appreciate your comments. Please let us know if you have further feedback.* Best, Authors ## Experiments / Visualization to add - [x] Running SAC baseline, or show RL working on easier examples - Updated new results in Appendix B. - [x] Add visualization of RL, show why they fail - See here: https://www.notion.so/notebookxingyu/SAC-Visualization-726880941aa0445d8863b720bdf53318 - [x] Add visualization of Traj opt and BC - [x] Compare DiffSkill with RL baselines on single-stage tasks - [x] Run RL on single stage task. Directly run single-stage policy? - [x] Try different horizon H for DiffSkill - Preliminary result - [x] Compare performance using a pre-trained VAE vs training VAE along with other losses - In progress ## Modification to the paper - [x] Added future work and limitation in the conclusion - [x] Modify introduction - [x] Fix all typos

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.