# 2022-05-22 Pupperfetch Progress & Lessons ### Base+arm policy We wanted to make a 3-way comparison between different ways of training the base+arm policy: - (a) training base and arm separately - (b) training base first, then arm - (c) training both together <-- This would be the obvious winner For (a, "training base and arm separately"), we want to include the arm when training, but either zero it out or send random actions. Surprisingly, this doesn't work at all. The policy needs control over the arm, otherwise it doesn't learn to compensate for it - both when it's zeroed and when the arm is moving randomly. (There is possibly some bug in here somewhere but the task is very, very similar to the other two and they work fine. Will revisit later.) Training a reaching policy with the arm on its own was easy and leads to better overall grasping range than when the arm is mounted on the pupper but this makes sense because the arm fully stretched out can lead to the Go1 tipping over and terminating the episode (and when training only the arm, the arm is fixed). ![](https://i.imgur.com/0aHnY4M.gif) Fig. 1 - Benchmarking standalong Reactor arm reaching policy, qualitative. Goals are tested sequentially (negative X, positive X, negative Y, positive Y, etc.). ![](https://i.imgur.com/taSXBmk.png) Fig. 2 - Benchmarking standalong Reactor arm reaching policy, quantitative (red line is desired target, blue line is actual position). Each column is one axis (x/y/z) and they're being tested sequentially, not in parallel. For (b, "training base first, then arm"), we want to give the policy control over the arm but not have any reward associated with the arm movement for the first part of training. Then once that's fully trained, we add an additional reward term for the arm. Interestingly, in phase 1, the base learns to use the arm for balance. ![](https://i.imgur.com/IIm3TVx.gif) Fig. 3a - Benchmark of the phase-1 policy on A1+ErgoJr, qualitative. The walking part is fine. The policy uses the arm for balance. ![](https://i.imgur.com/tWtBBlO.png) Fig. 3b - Benchmark of the phase-1 policy, quantitative. As expeced, the arm performance is random ![](https://i.imgur.com/tHRWnw5.gif) Fig. 4a - same thing but with the Go1+ReactorX --- For phase 2, I tried using the same rewards as with the walking policy but only adding the arm reach term, but this fails and doesn't learn anything for the arm. After some experimentation, the only thing that's able to learn something for the arm is when setting the arm reward term to 8x its normal value. And then it unlearns the walking policy, sits down and only does reaching. ![](https://i.imgur.com/n340Mv0.gif) Fig. 5a - Benchmark of the phase-2 policy (A1+ErgoJr), qualitative. The walking is completely gone and it just sits down, even though the goal moves with the robot. The reason is probably the recoil from the arm. When the arm moves towards the goal, the base could tip over and the most stable thing to prevent episode termination is to just sit. ![](https://i.imgur.com/tUr3teY.png) Fig. 5b - Benchmark of the phase-2 policy, quantitatively. The reaching performance is still not great. It only gets a fraction of the total range. ![](https://i.imgur.com/QqGSxdY.gif) Fig. 6 - Same thing on Go1+ReactorX For (c, "training both together"), this worked fine after a HP search over the 3 main reward terms (velocity-command-following, heading-command-following, and arm-command-following). It also works better when using a grasping curriculum that steadily increases the reach of the arm, compared to one that samples goal in the entire reach of the arm. ![](https://i.imgur.com/Ueiak7p.gif) Fig. 7 - Benchmark of the joint policy, qualitative. This works pretty well on all fronts. ![](https://i.imgur.com/xBBhmLX.png) Fig. 8 - Benchmark of the joint policy, quantitative. Same thing, except for negative X on the gripper, these are all great and that one is very hard to reach. So in conclusion, (a) <<< (b) <<< (c). They are miles apart each and it feels kinda wrong to even compare these approaches. --- ### Visual Navigation Simon trained three policies (a,b,c are independent from the above): - (a) one that just learns control with camera input (without floor lidar), with velocity commands. -> this works - (b) one with the goal encoded in the state (e.g. "the goal is 2m in front of you and 0.5m to the right"), and also the full proprioceptive state. -> this works - (c) one with the goal encoded only in the image as red ball that's visible from the start, and also the full proprioceptive state. -> this just doesn't work at all. The policy doesn't learn any movement and just keeps crashing. To address that problem, we'll try two things: - Since (a) worked, we can control the robot from vision and state, we can learn the high-level policy that outputs velocity commands for the low-level policy that outputs joint commands. - If that doesn't work or is too hard, we can try behavior cloning, where we create a dataset from the policy trained in (b) and train a network to imitate from images and states what the other policy does only from states.