# Humanoid Control
## Commands
Evaluating the clip snippet experts:
```bash=
python mocapact/clip_expert/evaluate.py \
--policy_root 00-share-data/humanoid_control/experts/policies/[CLIP_SNIPPET]/eval_rsi/model \
--n_eval_episodes 500 \
--n_workers [N_CPU] \
--device cuda \
--novisualize \
--act_noise 0. \
--eval_save_path 00-share-data/humanoid_control/experts/evaluations_take_2/deterministic/[CLIP_SNIPPET].npz
```
Evaluating the distilled policy on the clips:
```bash=
python humanoid_control/distillation/evaluate.py \
--policy_path 00-share-data/humanoid_control/distillation/few/npmp/upsampled/rwr/correlation0_kl0.1/0/2022-05-30_06-27-04/eval/train_rsi/best_model.ckpt \
--clip_snippets [CLIP] \
--n_eval_episodes 500 \
--n_workers [N_CPU] \
--device cuda \
--novisualize \
# Deterministic evaluation
--act_noise 0. \
--eval_save_path 00-share-data/humanoid_control/distillation/clip_evaluations/deterministic/[CLIP].npz
```
Evaluating the distilled policy on the snippets:
```bash=
python humanoid_control/distillation/evaluate.py \
--policy_path 00-share-data/humanoid_control/distillation/few/npmp/upsampled/rwr/correlation0_kl0.1/0/2022-05-30_06-27-04/eval/train_rsi/best_model.ckpt \
--clip_snippets [CLIP_SNIPPET] \
--n_eval_episodes 500 \
--n_workers [N_CPU] \
--device cuda \
--novisualize \
# Deterministic evaluation
--act_noise 0. \
--eval_save_path 00-share-data/humanoid_control/distillation/evaluations/deterministic/[CLIP_SNIPPET].npz \
# Noisy evaluation
--act_noise 0.1 \
--eval_save_path 00-share-data/humanoid_control/distillation/evaluations/noisy/[CLIP_SNIPPET].npz
```
Rolling out experts
```bash=
python humanoid_control/distillation/rollout_experts.py \
--input_dirs 00-share-data/humanoid_control/experts/policies \
--n_workers [N_CPU] \
--min_steps 30 \
--log_all_proprios \
--device cuda \
--separate_clips \
# "Large" dataset (200 rollouts per snippet)
--n_start_rollouts 100 \
--n_rsi_rollouts 100 \
--output_path 00-share-data/humanoid_control/distillation/large/ignore.hdf5 \
# "Small" dataset (20 rollouts per snippet)
--n_start_rollouts 10 \
--n_rsi_rollouts 10 \
--output_path 00-share-data/humanoid_control/distillation/small/ignore.hdf5
```
Training clip expert:
```bash=
python humanoid_control/clip_expert/train.py \
--clip_id [CLIP_ID] \
--start_step [START_STEP] \
--max_steps [MAX_STEPS] \
--min_steps 10
--total_timesteps 150000000 \
--learning_rate.start_val 3e-5 \
--learning_rate.decay_half_life 0.2 \
--learning_rate.min_val 1e-6 \
--n_workers 8 \
--n_steps 8192 \
--batch_size 512 \
--n_epochs 10 \
--clip_range 0.25 \
--gae_lambda 0.95 \
--eval.freq 1000000 \
--eval.min_steps 30
--eval.n_rsi_episodes 1000 \
--eval.n_start_episodes 100 \
--eval.start_eval_act_noise 0.01 \
--eval.early_stop.ep_length_threshold 0.98 \
--eval.early_stop.min_reward_delta 0.01 \
--log_root [LOG_ROOT]
```
# 5/26/2022
## Distillation
We see what distillation weighting scheme gives the best policy evaluation performance. We first observe that in a small amount of supervision steps that all policies achieve their optimal performance. Next, we see that policy performance is as follows: RWR, CWR, AWR, BC.

Next, we choose RWR as our weighting algorithm and vary the correlation and KL hyperparameters. First, we notice that a KL hyperparm of 0.03 gives optimal performance, with the performance not depending on the correlation hyperparam. Next, with a KL weight of 0.1, we see that a correlation hyperparam of 0.95 gives 1% better performance than a hyperparam of 0.

## Transfer to "Go-To-Target" Task
We fix the low-level policy from the distillation phase and train a new high level policy to maximize the reward $r = \exp(\text{minus distance to goal})$.
[Video](https://gtvault-my.sharepoint.com/:v:/g/personal/nwagener3_gatech_edu/EZb83KlroqBHmqg8VAhgXFEBuVVuFeAzvDqGEb60BcZA7A?e=18IBc9). The humanoid does a fairly good job of walking to the goal and remaining standing over it. When the goal is far away, the humanoid will attempt to run and will fall over. This is likely because the low-level policy has not learned to run over long distances.

## Joint Training with "Stand-Up" Task
Here, the RL task only considers the low-level decoding policy and inputs random noise $z \sim N(0, I^2)$ to act as synthetic motor intentions.
[Video](https://gtvault-my.sharepoint.com/:v:/g/personal/nwagener3_gatech_edu/EeUW0oVRQNdMniJ5gPw2pcoB_uoFV5elzysgPYpIFTMLDw?e=TcdIF0). The humanoid does a good job of remaining standing and about half the time can get off the ground.
However, mocap tracking performance seriously deteriorates.
[CMU_016_22](https://gtvault-my.sharepoint.com/:v:/g/personal/nwagener3_gatech_edu/ERFR2-GdQ-ZAgFbsDihHnowB0mTuRXVwiCw_SyXkSJGwFA?e=MXPIhz)
[CMU_069_22](https://gtvault-my.sharepoint.com/:v:/g/personal/nwagener3_gatech_edu/EffVef5EpxtEupx-9UeAsxkBlO8u98V-HYXwdUsBitAtiw?e=gnEUsd)
[CMU_139_16](https://gtvault-my.sharepoint.com/:v:/g/personal/nwagener3_gatech_edu/EW7IfpAP-g1PgFg3zN2gDuMBOQJDnlMv2ANLHdwP-1bzUw?e=RHTtJg)
# 5/19/2022
## Distillation Update
Surprisingly, it does not take many supervision steps for the policy performance to plateau (and even decrease thereafter). We can usually get away with about 20K steps (30 minutes on a GPU). On the full dataset (50M datapoints) with a batch size of 256, this corresponds to 10% of the dataset. I decided to make a smaller dataset with 10% of the demonstrations to see if there were "too many" demonstrations per clip.
In the plots below, we see there is no qualitative differences between the large and small dataset. In all cases, we see that validation supervision loss is minimized in only 2.5K steps and quickly increases. The policy evaluation on the validation set does go down, but so does on the training set (despite the training supervision loss improving). It's likely that the experts have some "conflict" between each other that gets exasperated as training continues.

# 5/12/2022
## Joint Training with Stand-Up Task
[Get up](https://gtvault-my.sharepoint.com/:v:/g/personal/nwagener3_gatech_edu/EUFqGYaktIlOnG6akbAQSBMB8abe7mYT0PpMA5guShRYaA?e=MHfrZb): 50% of time the agent is initialized to be standing off the round, the other 50% at some random point in a get-up clip. The agent does learn to get off the ground, though it uses fairly strange ways to do that (bend arms behind back, moving legs in extremely acrobatic ways). The walking motion is again idiosyncratic.
The policy maximizes the reward in ~30M steps (10 hours).

To help speed up training, I only take a gradient step on the expert data once every 100 iterations (produces about 2.5x speedup). The behavior cloning MSE does slowly increase over time and will likely worsen policy performance on the mocap data. I wouldn't be surprised if the MSE would keep increasing as training continued.

The action noise only slowly decreases throughout training. I think this huge amount of noise is the source of the humanoid's strange motion. However, this amount of noise may be needed to help exploration. Perhaps it would be better to anneal the noise instead of relying on gradient descent.

## Distillation (Weighted Regression)
clip snippet $c$ --> expert $\pi_c$ --> performance index $J = E_{\pi_c}[\sum_t r_t]$ --> weight $w = e^J$
Clip-weighted regression gives 5% improvement

Reward-weighted regression ($w = e^Q$) gives 10% improvement

Advantage-weighted regression ($w = e^A$) gives no improvement

# Joint Training with Stand-Up Task (5/11)
$r = \exp(-(h-h^*)^2)$
- [Remain standing](https://gtvault-my.sharepoint.com/:v:/g/personal/nwagener3_gatech_edu/EXrVSY5Zd5BLksdbLen-WRIB8EhpeD0UAV8reKcfTnd11A?e=RI2i36): The humanoid is initialized in a walking pose to make the task easier. While it can learn to do the task, the walking motion is very idiosyncratic, almost like a caveman. The shoulders also rock back-and-forth in a strange way.
- [Get up](https://gtvault-my.sharepoint.com/:v:/g/personal/nwagener3_gatech_edu/Ecu8VylkaJBLuFQ4_9vsB2kBye-5Ogf8D4aTfINfxh8VeA?e=unehYl): The humanoid is initialized at random points in "get-up" clips. Despite the reward being moderately high (0.9 per step, average height gap of 0.32 m or 13 inches), the humanoid cannot get up from the ground and even has trouble staying standing. It does not appear to be using the learned skills from the dataset.
# GPT Motion Generation (4/21)
Prompt: 32 steps (1 second) from clip expert
## Demo on CMU_069_22
- [Without dropout](https://gtvault-my.sharepoint.com/:v:/g/personal/nwagener3_gatech_edu/EVjwS4ORNDdIoKEFWUxKkfkBrpdGpswWk-Njg1DrWLi6Cg?e=cMl4DY): Humanoid goes in a perfect circle.
- [With dropout](https://gtvault-my.sharepoint.com/:v:/g/personal/nwagener3_gatech_edu/EYpILsf_Gr9GjEX-4aUftMMBmgj_Ju1ZAq4ALUF_9x7r6A?e=NUWmjd): Humanoid does keep turning, though at a more irregular rate, and then stops and looks around.
## Demo on CMU_038_03
- [Video](https://gtvault-my.sharepoint.com/:v:/g/personal/nwagener3_gatech_edu/Ec9dylzp2SNPoa-EPpOM2q0BGlGN_K8ohMU0bUzh2XPjkg?e=sFrc3U): Humanoid is able to keep running in circles and is able to recover from mild mistakes
- [Video when overfitting to this clip](https://gtvault-my.sharepoint.com/:v:/g/personal/nwagener3_gatech_edu/EVYCY45jCGhAlZ8LeyjlolkBOv_kuCA2nfaHBh9XZNJx3w?e=bkAT71): Humanoid can keep running for a little bit, but falls over much earlier than the other GPT.
## Other videos
- [CMU_016_22](https://gtvault-my.sharepoint.com/:v:/g/personal/nwagener3_gatech_edu/Ebu6AnVP_z5EqDT2aon8gG0B_YHlWlkdtA16nWeq2vhteQ?e=D6LnVQ): Short walking clip. Humanoid tries to keep walking forward. Sometimes it is able to do so for several seconds, other times it falls over pretty quickly.
- [CMU_016_55](https://gtvault-my.sharepoint.com/:v:/g/personal/nwagener3_gatech_edu/EQk6_L13xSxOh-Cey2ge5yYBe1vhSWygBM_8YYiFEX2aiQ?e=5yhakQ): Short running clip. Humanoid falls over pretty quickly after end of clip. This is likely because all running clips are pretty short.
- [CMU_056_05](https://gtvault-my.sharepoint.com/:v:/g/personal/nwagener3_gatech_edu/Eba3BEeGPE9AglBv14eRMksBVc7xMi95tM_PJOk0-XRVMA?e=JIFtec): This is a clip of vignettes. For the most part, the humanoid falls over in a few seconds, though it does try to repeat behaviors in the clip, such as waving its arms. This is a highly unique clip that is likely underrepresented in the dataset.
- [CMU_069_22](https://gtvault-my.sharepoint.com/:v:/g/personal/nwagener3_gatech_edu/EWk3uqIkGYNFr29s6XUW1REBf2dYikfzLQv8P2mszU8OaQ?e=8Xh6jw): Repeated walking and turning. The humanoid attempts to repeat the pattern, though it does keep falling over in relatively short order.
- [CMU_069_42](https://gtvault-my.sharepoint.com/:v:/g/personal/nwagener3_gatech_edu/EWTB0XxugO9IvAHtKU_RHDEBaXVm559jvB5xMduL7mlGEQ?e=zFHYt4): Side-stepping in circles. The humanoid tries to repeat the behavior, but tends to fall over in a few seconds. The humanoid remains upright longer towards the end of the clip.
- [CMU_075_01](https://gtvault-my.sharepoint.com/:v:/g/personal/nwagener3_gatech_edu/EQvu6nBf4Z5EurCOjdHWFIoB01Km9z-MArtJr3xkUcbJgA?e=8xm7n7): Run, jump, land, and turn. At best, the humanoid clones the clip and falls over at the end. Many times the humanoid falls over while attempting to land.
- [CMU_091_05](): "Zombie" walking. The humanoid attempts to keep walking slowly with arms straight forward, even when the ghost is not doing so. The humanoid falls over in less than 5 seconds.
- [CMU_143_16](): Walk and step over. The humanoid attempts to repeat this behavior and usually falls over after the first or second attempt.
## Meeting Notes (3/17)
- divide clips by behavior (walk, run, dance, etc.) and give stats
## Meeting Notes (2/24)
- put fine-tuning graph in appendix, mention that earlier plot had buggy evaluation, put returned expert histogram in main
- social media, blog
- provide interesting tasks
- user should choose reward function, new dataset generated
# Joint Training with Stand-Up Task
Augment the policy distillation with a stand-up task to be solved with RL. Reward = height of humanoid head.
$$\mathrm{loss} = \text{distillation loss} + \text{RL loss}$$
First, try a monolithic network on the get-up clips.

[CMU_139_16](https://gtvault-my.sharepoint.com/:v:/g/personal/nwagener3_gatech_edu/EXyLKkuzEatMoOtCesSknwEBhY3HZ3ATPQWdWiuvRk26zA?e=jzx9h6)
[All get-up clips](https://gtvault-my.sharepoint.com/:v:/g/personal/nwagener3_gatech_edu/EXyLKkuzEatMoOtCesSknwEBhY3HZ3ATPQWdWiuvRk26zA?e=jzx9h6)
Next, hierarchical policy with a separte encoder for the RL task. state = $(s,z)$, action = $(a, z')$

[CMU_139_16](https://gtvault-my.sharepoint.com/:v:/g/personal/nwagener3_gatech_edu/EchnOC2Cln9NrKghotxVXYQBsftoU-DwuXmi_QDLgfIaQQ?e=pNnukk)
[Stand up](https://gtvault-my.sharepoint.com/:v:/g/personal/nwagener3_gatech_edu/ESK0oSIzh9dBrTqOwiKOq1kBTk84Qq_ps6PffWNEWRbDRw?e=t0luWo
# Training Curves
CoMic paper training curves

My clip expert results






# Motion Generation Steps

Input: current proprioceptive information and context of recent observations (proprioceptive or images)
Action: joint angles sent to motor controllers
The main challenge is that the policy must figure out to achieve the desired motion in a physics-based domain. We instead *separate* the policy into a high-level policy that outputs a "motor intention" $z$, which is then fed to a low-level policy that turns that into the original low-level action.

First, we have to figure out a good representation for $z$ and the low-level policy:
1. For each clip, learn an expert policy, which is a feedforward policy with the pictured structure.

2. For each clip, generate noisy rollouts from the expert and collect into a supervision dataset.

3. Perform policy distillation by training the pictured policy on the supervision dataset. Note the reference encoder takes *desired future states* as input so that the motor intention would become an encoding of the desired future states. Throw away reference encoder.

Now we freeze the low-level policy and desire to train a network (e.g., GPT) to output good motor intentions $z_t$ for motion completion.

A couple ideas:
- Do behavioral cloning. This is what we've usually done and is simple to implement. This however may prioritize copying mocap clips instead of generating good motion completions.
- Run GAIL to directly train the high-level policy for motion completion. We simply desire to generate plausible-looking motion completions, so this approach makes more sense. The discriminator can either use the low-level state information or image observations to determine if the generated motion is plausible.
# Chat with Andrey


# CoMic keys
```python=
HL_PROPRIO_FIELDS = [
'walker/joints_pos',
'walker/joints_vel',
'walker/sensors_velocimeter',
'walker/sensors_gyro',
'walker/end_effectors_pos',
'walker/world_zaxis',
'walker/body_height',
'walker/sensors_touch',
'walker/sensors_torque',
'walker/actuator_activation',
'walker/reference_rel_bodies_pos_local',
'walker/reference_rel_bodies_quats'
]
LL_PROPRIO_FIELDS = [
'walker/joints_pos',
'walker/joints_vel',
'walker/sensors_velocimeter',
'walker/sensors_gyro',
'walker/end_effectors_pos',
'walker/world_zaxis',
'walker/sensors_touch',
'walker/sensors_torque',
'walker/actuator_activation',
]
```
# Embedding Idea (7/6)

The CoMic paper notably encodes the (future) reference poses into a stochastic embedding $z_t$ that is used to produce coherent control signals. With this in mind, perhaps for our task we should encode the **context** (i.e., past poses) into a stochastic embedding that summarizes the context and is useful for predicting the future desired poses/joint angles (e.g., should be able to predict the next five steps).

The GPT (or any other network) would as an encoder that will output a Gaussian distribution. We can then sample some embedding $z$ from this distribution and pass the sample through the rest of the diagram as usual.
The objective can include a regularization term $KL(GPT(z_t, s_{t-H:t})) || N(0, I))$ to ensure the clips have enough "overlap" in the learned embedding.
# RL Notes (7/1)
- CoMic observations and their keys:
- joint angles: `walker/joints_pos`
- joint angular velocities: `walker/joints_vel`
- velocimeter observation: `walker/sensors_velocimeter`, maybe also `walker/actuator_velocimeter`
- gyro observation: `walker/sensors_gyro`, maybe also `walker/gyro_control`
- end effector positions: `walker/end_effectors_pos`
- "up" direction in the frame of the humanoid: `walker/world_zaxis`
- actuator state: `walker/actuator_activation`
- touch sensors: `walker/sensors_touch`
- torque sensors: `walker/sensors_torque`
- reference poses: `walker/reference_rel_joints`, perhaps other keys
# Transitioning from Physics-Free to Physics-Based (6/24)
## Physics-Free Setup

Currently, the transformer takes as input a 1-second sequence of robot orientations and joint angles. The transformer then directly predicts the next robot pose and joint angles. While there are some signs of success in the physics-free domain, the following issues appear:
- The robot joints regularly freeze during a rollout.
- The robot can "glide" or levitate, which is not physically consistent.
- The robot can clip through itself.
## A Fix for Physics-Free

One issue is that the pose and joint angle predictions are decoupled, which is a likely cause of the "gliding" artifacts. In the physics-based domain, the pose is changed by changing the joint angles, so incorporating that structure into the transformer would likely help. While our initial domain does not have physics, the mocap data does contain some physics information, so we should be able to learn the relation between joint angles and pose from the data.
Accordingly, we can use the pictured architecture and perform behavioral cloning from the data (maybe including multi-step training, I'll have to think about it more).
## Going to Physics-Based Domain
The good news for our particular problem is that the action spaces of the physics-free and physics-based domains are the same (desired joint angles). So there is some hope of transferring the transformer to the physics-based domain after training in the physics-free domain. However, this is not likely since the physics-based domain has gravity, contact physics, potential for falling over, and joint controller effects.
The hope, though, is that the transformer can at least set good reference joint angles that some simpler policy can then "translate" to appropriate setpoints for the joint controllers. The image below shows this factored approach, with the additional network shown in the blue box.

I propose we use the following hybrid objective function:
$$
\text{loss} = \text{physics-free loss} + \text{physics-based loss}
$$
The physics-free loss would encourage the transformer to predict joint angles that closely match the clip's, while the physics-based loss would force the transformer to produce joint angles that are realizable by the MuJoCo humanoid. Having a physics-free loss around will also provide a more "consistent" signal to the transformer since the physics-based loss is inherently more difficult to optimize.
The physics-free loss can likely be a simple supervision objective (e.g., behavioral cloning with a single-step or multi-step loss).
The physics-based loss, though conceputally simple, may be less straightforward to optimize. This loss seeks to find actions that would cause the humanoid to closely match the mocap clip. However, this is not a supervised learning objective since we don't have the correct actions, only the observations from the clip. Thus, we may have to use reinforcement learning (e.g., policy gradient) to get a gradient for this term.
For the physics-based domain, we may be able to do some pre-training of the MLP in the blue box to serve as a good initialization for the hybrid objective. For example, we can do behavioral cloning of the MLP from the scipy optimizer. Alternatively, we can perform RL on the MLP using the reference joints from the clip as input to the network instead of the GPT predictions. For the final hybrid objective, we can also include gradients from this "pre-training" to also aid the transformer+MLP in finding good actions for the physics-based domain.