RL / ML-Agents notes

# RL / ML-Agents notes [ref](https://github.com/Unity-Technologies/ml-agents/blob/release_17_branch/docs/Training-Configuration-File.md) ## Common Hyper-Parameters * max_steps * default = 500000 * Typical range: 5e5 - 1e7 * time_horizon * Typical range: 32 - 2048 * value estimate/steps of experience to collect per-agent before adding it to the experience buffer * * batch_size * buffer_size * learning_rate * default = 3e-4 * Typical range: 1e-5 - 1e-3 * threaded * default = false * Allow environments to step while updating the model * This might result in a training speedup, especially when using SAC ## PPO * beta * epsilon * lambd * num_epoch ## SAC * buffer_init_steps * default = 0; Typical range: 1000 - 10000 * Number of experiences to collect into the buffer before updating the policy model * init_entcoef * default = 1.0; Typical range: (Continuous): 0.5 - 1.0; * initial entropy coefficient (Discrete): 0.05 - 0.5 * save_replay_buffer * default = false * help resumes go more smoothly as the the experiences collected won't be wiped * Requires large disk space * tau: * Typical range: 0.005 - 0.01 * low: Stable * high: Fast learning * steps_per_update * lower will improve sample efficiency (reduce the number of steps required to train) but increase the CPU time spent performing updates * For fast environments (like examples) * = the number of agents in the scene is a good balance * For slow environments * Reduce steps_per_update * reward_signal_num_update * default = `steps_per_update` * For IL ([ref](https://arxiv.org/pdf/1809.02925.pdf)) * N / M * where N is `steps_per_update` * and M is `gail update` ## GAIL Intrinsic Reward * demo_path * strength * gamma * network_settings: for GAIL discriminator * learning_rate: used to update the discriminator * use_actions (default = false) * whether the discriminator should discriminate based on both observations and actions, or just observations * use_vail (default = false) * makes learning more stable, but increases training time ## Behavioral Cloning * demo_path * strength * steps (default = 0; same as max_step) * batch_size * num_epoch: Typical range: 3 - 10 * samples_per_update * Typical range: buffer_size ## Memory-enhanced Agents using Recurrent Neural Networks * network_settings -> memory -> memory_size (default = 128) * Typical range: 32 - 256 * network_settings -> memory -> sequence_length (default = 64) * Typical range: 4 - 128 1. use discrete actions for better results 2. recommended to decrease num_layers when using recurrent 3. It is required that memory_size be divisible by 2. * Typical range: 4 - 128 # Model ## 2021-11-11 ### Set-up ### Goal Scoop the Goal Point, fill the bucket with particles and then align rhe bucket with the ground ### Actions 4 discrete action branch with 3 actions 1. Move left, do nothing, move right 2. Move up, do nothing, move down 3. Move front, do nothing, move back 4. Rotate positive, do nothing, Rotate negative ### Observations (19 + camera) 1. Camera (84x84) (Front camera) 2. Goal Position wrt excavator(3) 3. Goal Position wrt bucket zoom0(3) 4. Goal Position wrt bucket zoom1 (3) 5. Goal Position wrt bucket zoom2 (3) 6. Bucket Position wrt excavator (3) 7. Bucket Weight(1) 8. Bucket collision(1) 9. Bucket tip direction (2)   ### Rewards * Agent Reward Function: * * Benchmark Mean Reward: 15.0 :::spoiler dgf ::: ![](https://i.imgur.com/Oextmww.png)