# STEVE
## 2022-05-31
https://hackmd.io/qZtEpw97Sc6sRIsA_duS-g
https://wandb.ai/yifu/ocrl/reports/5-31-2022--VmlldzoyMDkxMzI4?accessToken=po774gotw2h8yq7jghx4hc8b0s2to1sq5ifmmjx65tdoyvz4x4p377b6fnz6i41t
- Avoidance w/ color change
- Agent gets positive reward when interacting with green ball and negative reward when interacting with blue ball
- (Optionally) Green and blue ball swaps colors when interacting
- Agent now needs to be aware of color of other balls and how they interact
- Local (eg. last 2 timestep) policy may not work as well
- Can be used as a transfer task




## 2022-05-24
https://wandb.ai/yifu/ocrl/reports/5-24-2022--VmlldzoyMDU1ODcx?accessToken=w8uw9q12xnfct2049sp359dqexjq83ih8obsuziz6suqszjoaogkilccf9hgmas3
## 2022-05-10
- Try other OC models
- CSWM
- Lower dim slots
- Try pretrained CNN vae
- Try predicting properties (color of balls).
- Expect CNN to ignore if only trained with reward
- Train feature classifier
- Investigate CNN features
- What is possible:
- Size
- Shape
- Color
- Position?
- First predict properties, then try OOD generalziation RL based on those properties
- Avoidance Task
Prev Result:
| | 3 Balls | 4 Balls | 5 Balls |
| -------- | -------- | -------- | ------- |
| Pixels (Conv) | -1.62(1.32) | -6.78(14.03) | -10.14(14.28)|
| Self-attention | -1.51(1.43) | -3.5(3.67) | -7.65(10.84) |
| Deepset | -1.49(1.35) | -1.84(1.41) | -3.62(5.56) |
| TF (cls token) | -1.44(0.94) | -2.35(2.13) | -3.48(3.21) |
| MLP | -3.45(8.95) | -- | -- |
- After tuning, pixels (CNN) does much better, even for generalization:

- Pixels currently also performs better in IL setting
## 2022-05-03
### CausalWorld
- GT States - Random initial position working now

- Green: Fixed start positions

- Blue: Fixed start tool, random goal

- Red: Random start tool, random goal

- Steve + robot
- Still not learning correctly (colors are different hyperparameters)

- Added some caching to speed up training (3M takes ~16 hours compared to previously 36 hours (gt takes ~3hrs))
- Idea: Try Imitation learning
- Obtain "expert" policy by training with GT states
- Then use behavioral cloning to train on pixels (slots)
- Can try generalizing to more objects than were trained on (?)
- Compare CNN vs slots vs GT (oracle)
- MLP vs. Deep Set vs. Self Attention vs. GNN
- Sample Efficiency vs. Pixels?
- Finetuning with RL?
- Previous papers that used IL for policy learning
- Policy Architectures for Compositional Generalization in Control
- The (Un)Surprising Effectiveness of Pre-Trained Vision Models for Control
### OP3
- Done:
- CATER
- CATERTex
- Movi-E
- Re-training:
- Bouncing Sprites
- Movi-Solid
- MovidTex
- TODO: generate OOD dataset for Movi-solid
## 2022-04-26
### CausalWorld
- Training from gt states using SAC converges much more quickly

- Red: SAC using dense reward (delta object distance to goal, delta distance gripper to object, fractional overlap)

- Pink: SAC using distance to goal reward only (distreward)

- Orange / Blue: PPO using dense reward (diff hyperparam)

- Gray: SAC using distreward w/ random start pos

- Green: SAC using dense reward w/ random start pos

- Using Image + robot state not working yet


- Steve + robot - still finishing implementation
- OP3 Progresss
- Finished trianing for all seeds except movi-e
- Need to run eval
## 2022-04-20
### CausalWorld
- Training **from gt states** is able to start solving task near 100m timesteps


- So far, using images or slots is unable to learn yet, but only around 10-15M steps (~1.5 days)

- Representation Learning for OOD paper is able to learn with 3M frames using:
- pretrained VAEs
- SAC (instead of PPO)
- target position as input (instead of visual input)
- allow only 1 finger to move in RL experiments (TODO)
- IsaacGym (https://tetexiao.com/projects/mvp)
- Simpler mechanics (gripper)
- Solves more quickly using their model (pretrained in-the-wild ViT)

- 16 million steps in ~10 hrs (vs. 36 hrs for causal world)
### Avoidance
- Stacking slots (3 slots) seems to help:
- Previous policies only stayed in corner


### STEVE
- Also running OP3, but stopped due to disk error. Will restart
## 2022-04-14
- CausalWorld

- Green: Using gt state (blocks + robot)
- Blue: Image w/ robot state
- Orange: Slots w/ robot state (self attention)
- Concern about training time:
- Amazon paper trained for several hundred million timesteps


- With gt state, may take ~1.2 days per 100 million steps
- Using images is ~5/10x slower (because of pybullet rendering time)
- Potentially look at IssacGym env (https://tetexiao.com/projects/mvp)

- GT state policies



- Image policies



- Slot policies



Debugging notes:
- Important to use embedding for robot state
- I think this helps to scale the input for the policy network
- Otherwise, robot falls into degenerate policy
- Avoidance task

- Orange: pixels
- Others: slots
- Currently, slots converge to worse policy than pixels
- Possible that slots only contain information to reconstruct (no dynamics?) -> try using previous timesteps' slots as well
- Need to add caching to speed up learning
- Other thoughts
- STEVE ok to use? Maybe SAVi or even SLATE?
## 2022-04-07
- Able to get policy that avoids balls
- Use TD(0) instead of TD($\lambda$) (ie. one step temporal-difference)





- Slot-based methods train more quickly, in terms of number of steps. Per-step is much slower (~3x slower per step)

- Deepset, Self attention, Transformer perform similarly so far

- Fine-tuning
- Original attention (before fine tuning)


- After some fine tuning


- From scratch


- Looks like early stage training of STEVE
- Slot-only (no decoder)


- To try: alternate training model + policy (similar to Dreamer)
## 2022-03-29
| | 3 Balls | 4 Balls | 5 Balls |
| -------- | -------- | -------- | ------- |
| Pixels (Conv) | -1.62(1.32) | -6.78(14.03) | -10.14(14.28)|
| Self-attention | -1.51(1.43) | -3.5(3.67) | -7.65(10.84) |
| Deepset | -1.49(1.35) | -1.84(1.41) | -3.62(5.56) |
| TF (cls token) | -1.44(0.94) | -2.35(2.13) | -3.48(3.21) |
| MLP | -3.45(8.95) | -- | -- |
- Converged policies still moves agent to corner
- Sparse reward after agent is in corner
- Trying 2 modifications:
- Dense reward depending on distance to closest ball



- Small positive reward per timestep and large negative reward + end episode when ball hits agent



## 2022-03-24
- Updated results using framestacking setup: The policy is given 3 frames per action. STEVE is run over the 3 frames and last frame's slots are used for actor/critic. Previous result was using slots as a recurrent state between policy steps.

- Light blue: Pixels
- Orange: Self-attention
- Gray: Deepset
- Pink: Transformer, pre-predictor slots
- Green: Transformer
- Dark Blue: Transformer, completely separate actor + critic networks
- Red: MLP
All models trained with 3 balls
| | 3 Balls | 4 Balls | 5 Balls |
| -------- | -------- | -------- | ------- |
| Pixels | -2.18 (8.65) | -6.86 (13.16) | -9.92 (16.78) |
| Self-attention | -1.36 (1.42) | -2.93 (5.26) | -10.43 (20.27) |
| Deepset | -1.38 (1.14) | -2.66 (3.11) | -3.5 (5.17) |
- Upon closer inspection, there is a bug in the environment where balls sometimes get stuck. May significantly affect RL + evaluation. Will fix and re-run.

- Sample policy:
Pixels:


Self-attention:


Deepset:


## 2022-03-17
- STEVE on avoidance task
- Cannot stably train when balls are always the same color



- Varying colors of balls helps:
- Random colors for non-agent balls

- Agent (red) gets same slot as BG
- Random colors for all balls

- RL Results

- Orange: From pixels
- Dark blue: Transformer w/ cls token
- Higher Red: Transformer w/ cls token - use pre-predictor slots
- Light blue: Transformer w/ cls token small
- Green: MLP
- Gray: DeepSet
- Lower Red: SelfAttention (transformer w/ sum)
## 2022-03-10
- Ball Avoidance Task


- Orange: Convnet
- trained with 3 balls
- Eval 3 balls - mean: -3.31, std: 2.41
- Eval 4 balls - mean: -6.23, std: 3.59
- Pink: Pretrained Slots
- Bad slot segmentations:

- Also tried training E2E (slightly different dataset and architecture), but currently also not getting great slot representations:

## 2022-03-01
- Causal World
- Still unable to learn in pixel space
- Tried changing dense reward to match Amazon paper
- IoU + Delta distance to goal + Curiosity
- Takes about 2 days to reach 10 million steps
- Amazon paper took 10 million steps with object states (not images)
- STOVE ball avoidance task

Image pixels:

## 2022-02-23
STEVE + RL
- Implement using RLlib

- Gray: Image A2C
- Orange: Image PPO
- Light Blue: Slot A2C
- Pink: Slot PPO
- Sample Trajectories


- Currently pretraining STEVE and freezing weights

- Other experiments to try
- SAVi
- Jointly train
- Pretrain with grad
- Train with reward only (?)
## 2022-02-15
### CATER
| | MLP Predictor | RNN Predictor | Transformer Predictor | OCVT |
| ---- | ------------- | ------------- | --------------------- | --- |
| Top1 | .12 | .44 | .42 | .76 |
| Top5 | .40 | .74 | .80 | .95 |
- Slots themselves may not contain temporal information. For RL, may not work with POMDP tasks
### CausalWorld

- Task is goal conditioned
- For multi-object case, need to specify which object to move
- Try simple version with 1 object where goal is always same place and dense reward
- Dense reward is based on distance of object to goal and distance of end effectors to object
- AC and PPO on just images: Currently, policy is just learning to avoid object


- Next:
- Debug image RL
- Attach STEVE / Savi encoder as feature extractor
- STEVE (12 slots) 
- STEVE (6 slots) 
- SAVI (6 slots) 
---
## 2022-01-19
- G-SWM
- CATER (T=6):
- FG-ARI: ~97
- CATERTex(T=6):
- FG-ARI: ~65
- TexSprites (T=24):
- In Distribution - FG-ARI: ~86
- Out of Distribution Textures - FG-ARI: ~65
- Out of Distribution # Obj - FG-ARI: ~81
- Textured Movi++:

- OP3
- Some versioning issues with environment (resolved)
- Not working correctly on simple dataset yet (debugging)

## 2021-12-15
- CATER
- STEVE

- SAVi

- Retraining with 10 slots
- Kubric
- Docker issue on our servers (resolved)
- Still unable to get HDRI background working (in progress)