# STEVE ## 2022-05-31 https://hackmd.io/qZtEpw97Sc6sRIsA_duS-g https://wandb.ai/yifu/ocrl/reports/5-31-2022--VmlldzoyMDkxMzI4?accessToken=po774gotw2h8yq7jghx4hc8b0s2to1sq5ifmmjx65tdoyvz4x4p377b6fnz6i41t - Avoidance w/ color change - Agent gets positive reward when interacting with green ball and negative reward when interacting with blue ball - (Optionally) Green and blue ball swaps colors when interacting - Agent now needs to be aware of color of other balls and how they interact - Local (eg. last 2 timestep) policy may not work as well - Can be used as a transfer task ![](https://i.imgur.com/PB1NXll.gif) ![](https://i.imgur.com/EeI3LWw.gif) ![](https://i.imgur.com/0BOSvkd.gif) ![](https://i.imgur.com/z851BYi.gif) ## 2022-05-24 https://wandb.ai/yifu/ocrl/reports/5-24-2022--VmlldzoyMDU1ODcx?accessToken=w8uw9q12xnfct2049sp359dqexjq83ih8obsuziz6suqszjoaogkilccf9hgmas3 ## 2022-05-10 - Try other OC models - CSWM - Lower dim slots - Try pretrained CNN vae - Try predicting properties (color of balls). - Expect CNN to ignore if only trained with reward - Train feature classifier - Investigate CNN features - What is possible: - Size - Shape - Color - Position? - First predict properties, then try OOD generalziation RL based on those properties - Avoidance Task Prev Result: | | 3 Balls | 4 Balls | 5 Balls | | -------- | -------- | -------- | ------- | | Pixels (Conv) | -1.62(1.32) | -6.78(14.03) | -10.14(14.28)| | Self-attention | -1.51(1.43) | -3.5(3.67) | -7.65(10.84) | | Deepset | -1.49(1.35) | -1.84(1.41) | -3.62(5.56) | | TF (cls token) | -1.44(0.94) | -2.35(2.13) | -3.48(3.21) | | MLP | -3.45(8.95) | -- | -- | - After tuning, pixels (CNN) does much better, even for generalization: ![](https://i.imgur.com/g7DaWMl.png) - Pixels currently also performs better in IL setting ## 2022-05-03 ### CausalWorld - GT States - Random initial position working now ![](https://i.imgur.com/zK5YtlP.png) - Green: Fixed start positions ![](https://i.imgur.com/hu6f0YF.gif) - Blue: Fixed start tool, random goal ![](https://i.imgur.com/4W2BVC5.gif) - Red: Random start tool, random goal ![](https://i.imgur.com/4PMROFq.gif) - Steve + robot - Still not learning correctly (colors are different hyperparameters) ![](https://i.imgur.com/Z98xuDh.png) - Added some caching to speed up training (3M takes ~16 hours compared to previously 36 hours (gt takes ~3hrs)) - Idea: Try Imitation learning - Obtain "expert" policy by training with GT states - Then use behavioral cloning to train on pixels (slots) - Can try generalizing to more objects than were trained on (?) - Compare CNN vs slots vs GT (oracle) - MLP vs. Deep Set vs. Self Attention vs. GNN - Sample Efficiency vs. Pixels? - Finetuning with RL? - Previous papers that used IL for policy learning - Policy Architectures for Compositional Generalization in Control - The (Un)Surprising Effectiveness of Pre-Trained Vision Models for Control ### OP3 - Done: - CATER - CATERTex - Movi-E - Re-training: - Bouncing Sprites - Movi-Solid - MovidTex - TODO: generate OOD dataset for Movi-solid ## 2022-04-26 ### CausalWorld - Training from gt states using SAC converges much more quickly ![](https://i.imgur.com/H3xtxhS.png) - Red: SAC using dense reward (delta object distance to goal, delta distance gripper to object, fractional overlap) ![](https://i.imgur.com/uQPrNk2.gif =200x) - Pink: SAC using distance to goal reward only (distreward) ![](https://i.imgur.com/BfYR4jl.gif =200x) - Orange / Blue: PPO using dense reward (diff hyperparam) ![](https://i.imgur.com/L96k2Ga.gif =220x) - Gray: SAC using distreward w/ random start pos ![](https://i.imgur.com/hjVljsP.gif =200x) - Green: SAC using dense reward w/ random start pos ![](https://i.imgur.com/uUPDVOH.gif =200x) - Using Image + robot state not working yet ![](https://i.imgur.com/DOYy1nn.png) ![](https://i.imgur.com/x0aXUXe.gif =200x) - Steve + robot - still finishing implementation - OP3 Progresss - Finished trianing for all seeds except movi-e - Need to run eval ## 2022-04-20 ### CausalWorld - Training **from gt states** is able to start solving task near 100m timesteps ![](https://i.imgur.com/zwywk7F.png) ![](https://i.imgur.com/Vn9HD9t.gif) - So far, using images or slots is unable to learn yet, but only around 10-15M steps (~1.5 days) ![](https://i.imgur.com/qWoLyWA.png) - Representation Learning for OOD paper is able to learn with 3M frames using: - pretrained VAEs - SAC (instead of PPO) - target position as input (instead of visual input) - allow only 1 finger to move in RL experiments (TODO) - IsaacGym (https://tetexiao.com/projects/mvp) - Simpler mechanics (gripper) - Solves more quickly using their model (pretrained in-the-wild ViT) ![](https://i.imgur.com/JIh1s3e.png) - 16 million steps in ~10 hrs (vs. 36 hrs for causal world) ### Avoidance - Stacking slots (3 slots) seems to help: - Previous policies only stayed in corner ![](https://i.imgur.com/a27l4FQ.gif) ![](https://i.imgur.com/KMypCQu.png) ### STEVE - Also running OP3, but stopped due to disk error. Will restart ## 2022-04-14 - CausalWorld ![](https://i.imgur.com/BcTiYcN.png) - Green: Using gt state (blocks + robot) - Blue: Image w/ robot state - Orange: Slots w/ robot state (self attention) - Concern about training time: - Amazon paper trained for several hundred million timesteps ![](https://i.imgur.com/7ejFwrH.png) ![](https://i.imgur.com/tcaURLO.png) - With gt state, may take ~1.2 days per 100 million steps - Using images is ~5/10x slower (because of pybullet rendering time) - Potentially look at IssacGym env (https://tetexiao.com/projects/mvp) ![](https://i.imgur.com/SHspnAG.png) - GT state policies ![](https://i.imgur.com/wv2PDfQ.gif) ![](https://i.imgur.com/UMpxsie.gif) ![](https://i.imgur.com/2LkZQ5A.gif) - Image policies ![](https://i.imgur.com/M1UwDZP.gif) ![](https://i.imgur.com/YJZyEbI.gif) ![](https://i.imgur.com/4tEVlrP.gif) - Slot policies ![](https://i.imgur.com/jUTYYGS.gif) ![](https://i.imgur.com/MuGdVs8.gif) ![](https://i.imgur.com/gYJLMxH.gif) Debugging notes: - Important to use embedding for robot state - I think this helps to scale the input for the policy network - Otherwise, robot falls into degenerate policy - Avoidance task ![](https://i.imgur.com/Uk89KK2.png) - Orange: pixels - Others: slots - Currently, slots converge to worse policy than pixels - Possible that slots only contain information to reconstruct (no dynamics?) -> try using previous timesteps' slots as well - Need to add caching to speed up learning - Other thoughts - STEVE ok to use? Maybe SAVi or even SLATE? ## 2022-04-07 - Able to get policy that avoids balls - Use TD(0) instead of TD($\lambda$) (ie. one step temporal-difference) ![](https://i.imgur.com/NHGkNwx.gif) ![](https://i.imgur.com/BsduuJz.gif) ![](https://i.imgur.com/3wyDQXK.gif) ![](https://i.imgur.com/3F6QULW.gif) ![](https://i.imgur.com/HkMDjiI.gif) - Slot-based methods train more quickly, in terms of number of steps. Per-step is much slower (~3x slower per step) ![](https://i.imgur.com/ppqrIPT.png) - Deepset, Self attention, Transformer perform similarly so far ![](https://i.imgur.com/QrLE5FE.png) - Fine-tuning - Original attention (before fine tuning) ![](https://i.imgur.com/XY2w8fd.gif) ![](https://i.imgur.com/Su36pgT.gif) - After some fine tuning ![](https://i.imgur.com/z8yZmZj.gif) ![](https://i.imgur.com/XAyWPxM.gif) - From scratch ![](https://i.imgur.com/7dJCVeN.gif) ![](https://i.imgur.com/am7sn0U.gif) - Looks like early stage training of STEVE - Slot-only (no decoder) ![](https://i.imgur.com/ldecdpB.gif) ![](https://i.imgur.com/Pa40iCh.gif) - To try: alternate training model + policy (similar to Dreamer) ## 2022-03-29 | | 3 Balls | 4 Balls | 5 Balls | | -------- | -------- | -------- | ------- | | Pixels (Conv) | -1.62(1.32) | -6.78(14.03) | -10.14(14.28)| | Self-attention | -1.51(1.43) | -3.5(3.67) | -7.65(10.84) | | Deepset | -1.49(1.35) | -1.84(1.41) | -3.62(5.56) | | TF (cls token) | -1.44(0.94) | -2.35(2.13) | -3.48(3.21) | | MLP | -3.45(8.95) | -- | -- | - Converged policies still moves agent to corner - Sparse reward after agent is in corner - Trying 2 modifications: - Dense reward depending on distance to closest ball ![](https://i.imgur.com/killldD.gif) ![](https://i.imgur.com/5meWiTs.gif) ![](https://i.imgur.com/nfzIE0K.gif) - Small positive reward per timestep and large negative reward + end episode when ball hits agent ![](https://i.imgur.com/bdtMtH6.gif) ![](https://i.imgur.com/Y6wMxb7.gif) ![](https://i.imgur.com/rzmm3Hy.gif) ## 2022-03-24 - Updated results using framestacking setup: The policy is given 3 frames per action. STEVE is run over the 3 frames and last frame's slots are used for actor/critic. Previous result was using slots as a recurrent state between policy steps. ![](https://i.imgur.com/9svmEfK.png) - Light blue: Pixels - Orange: Self-attention - Gray: Deepset - Pink: Transformer, pre-predictor slots - Green: Transformer - Dark Blue: Transformer, completely separate actor + critic networks - Red: MLP All models trained with 3 balls | | 3 Balls | 4 Balls | 5 Balls | | -------- | -------- | -------- | ------- | | Pixels | -2.18 (8.65) | -6.86 (13.16) | -9.92 (16.78) | | Self-attention | -1.36 (1.42) | -2.93 (5.26) | -10.43 (20.27) | | Deepset | -1.38 (1.14) | -2.66 (3.11) | -3.5 (5.17) | - Upon closer inspection, there is a bug in the environment where balls sometimes get stuck. May significantly affect RL + evaluation. Will fix and re-run. ![](https://i.imgur.com/M4dORAy.gif) - Sample policy: Pixels: ![](https://i.imgur.com/YwSy4FC.gif) ![](https://i.imgur.com/KmOgCTn.gif) Self-attention: ![](https://i.imgur.com/wIxVFAD.gif) ![](https://i.imgur.com/X503gKZ.gif) Deepset: ![](https://i.imgur.com/lUKoYCw.gif) ![](https://i.imgur.com/v9P3syx.gif) ## 2022-03-17 - STEVE on avoidance task - Cannot stably train when balls are always the same color ![](https://i.imgur.com/k9fxJPG.gif =400x) ![](https://i.imgur.com/hjTa93C.gif =400x) ![](https://i.imgur.com/72OT1kj.gif =400x) - Varying colors of balls helps: - Random colors for non-agent balls ![](https://i.imgur.com/4lvNQ3D.gif) - Agent (red) gets same slot as BG - Random colors for all balls ![](https://i.imgur.com/Ff19XA0.gif) - RL Results ![](https://i.imgur.com/Nv5girW.png) - Orange: From pixels - Dark blue: Transformer w/ cls token - Higher Red: Transformer w/ cls token - use pre-predictor slots - Light blue: Transformer w/ cls token small - Green: MLP - Gray: DeepSet - Lower Red: SelfAttention (transformer w/ sum) ## 2022-03-10 - Ball Avoidance Task ![](https://i.imgur.com/k8vkpYy.gif) ![](https://i.imgur.com/Qb2APC4.png =200x) - Orange: Convnet - trained with 3 balls - Eval 3 balls - mean: -3.31, std: 2.41 - Eval 4 balls - mean: -6.23, std: 3.59 - Pink: Pretrained Slots - Bad slot segmentations: ![](https://i.imgur.com/2OgltFI.gif =200x) - Also tried training E2E (slightly different dataset and architecture), but currently also not getting great slot representations: ![](https://i.imgur.com/qOs2sWb.gif) ## 2022-03-01 - Causal World - Still unable to learn in pixel space - Tried changing dense reward to match Amazon paper - IoU + Delta distance to goal + Curiosity - Takes about 2 days to reach 10 million steps - Amazon paper took 10 million steps with object states (not images) - STOVE ball avoidance task ![](https://i.imgur.com/UsPzRH6.gif) Image pixels: ![](https://i.imgur.com/DJI4LCQ.png) ## 2022-02-23 STEVE + RL - Implement using RLlib ![](https://i.imgur.com/D9exEGx.png) - Gray: Image A2C - Orange: Image PPO - Light Blue: Slot A2C - Pink: Slot PPO - Sample Trajectories ![](https://i.imgur.com/DkhRtbh.gif =200x) ![](https://i.imgur.com/4oSpITw.gif =200x) - Currently pretraining STEVE and freezing weights ![](https://i.imgur.com/0qdVDFm.gif) - Other experiments to try - SAVi - Jointly train - Pretrain with grad - Train with reward only (?) ## 2022-02-15 ### CATER | | MLP Predictor | RNN Predictor | Transformer Predictor | OCVT | | ---- | ------------- | ------------- | --------------------- | --- | | Top1 | .12 | .44 | .42 | .76 | | Top5 | .40 | .74 | .80 | .95 | - Slots themselves may not contain temporal information. For RL, may not work with POMDP tasks ### CausalWorld ![](https://i.imgur.com/7LfGSgw.gif =200x) - Task is goal conditioned - For multi-object case, need to specify which object to move - Try simple version with 1 object where goal is always same place and dense reward - Dense reward is based on distance of object to goal and distance of end effectors to object - AC and PPO on just images: Currently, policy is just learning to avoid object ![](https://i.imgur.com/eIinwew.gif) ![](https://i.imgur.com/dvEJ0TE.gif) - Next: - Debug image RL - Attach STEVE / Savi encoder as feature extractor - STEVE (12 slots) ![](https://i.imgur.com/VZFEMAJ.gif) - STEVE (6 slots) ![](https://i.imgur.com/GQgApRZ.gif) - SAVI (6 slots) ![](https://i.imgur.com/ZfEKfTs.gif) --- ## 2022-01-19 - G-SWM - CATER (T=6): - FG-ARI: ~97 - CATERTex(T=6): - FG-ARI: ~65 - TexSprites (T=24): - In Distribution - FG-ARI: ~86 - Out of Distribution Textures - FG-ARI: ~65 - Out of Distribution # Obj - FG-ARI: ~81 - Textured Movi++: ![](https://i.imgur.com/XOzv9g6.jpg) - OP3 - Some versioning issues with environment (resolved) - Not working correctly on simple dataset yet (debugging) ![](https://i.imgur.com/Q4MQXzR.png) ## 2021-12-15 - CATER - STEVE ![](https://i.imgur.com/x0sIBX1.gif) - SAVi ![](https://i.imgur.com/fCIhS2w.gif) - Retraining with 10 slots - Kubric - Docker issue on our servers (resolved) - Still unable to get HDRI background working (in progress)