# Nepal Winter School 2019, RL lab assignment ## Setup ### ONLY ON WINDOWS: Go to https://www.microsoft.com/en-us/download/details.aspx?id=100593, download the program and install it. Also have Anaconda Python installed in version 3.6 or 3.7. You also need to install some other packages for python: conda install -c anaconda mpi4py conda install -c conda-forge pystan conda install -c conda-forge pyglet conda install -c conda-forge swig pip install Box2D # to use box2d physics-based tasks pip install atari-py # to use Atari game tasks ### Everyone: I assume that you have Python 3.6/3.7 and git installed. If you don't, check out Miniconda: https://docs.conda.io/en/latest/miniconda.html Download and install OpenAI Gym: git clone https://github.com/openai/gym.git cd gym # install gym in developer mode pip install -e . Download and install OpenAI SpinningUp: # IF you're in the gym directory, go one level up cd .. # ONLY IF you're on Ubuntu sudo apt-get update && sudo apt-get install libopenmpi-dev # ONLY IF your'e on Mac brew install openmpi git clone https://github.com/openai/spinningup.git cd spinningup pip install -e . Verify installation by training a simple PPO agent python -m spinup.run ppo --hid "[16,16]" --env Pendulum-v0 --exp_name installtest --gamma 0.999 # THIS WILL TRAIN FOR A QUITE A FEW MINUTES (~30min on my Mac) # BUT please stop it after max 3min. This is just to verify everything works # Look at the (partially) learned policy in action python -m spinup.run test_policy data/installtest/installtest_s0 # Plot the reward over time python -m spinup.run plot data/installtest/installtest_s0 ## Assignment ### 1) Understand the environment - While you're doing the next few steps, train a PPO agent on this environment, like we did above, but give it 10-15min, and change `installtest` to `pendulum` or something else you like - Create a new Python script - Implement the basic gym loop: ```python import gym env = gym.make("Pendulum-v0") while True: obs = env.reset() done = False while not done: # in practice the action comes from your policy action = env.action_space.sample() obs, rew, done, misc = env.step(action) # optional env.render() ``` - Inspect the observations dimensions and action dimensions: ```python # this gives you the dimensions but not the upper lower bound print (env.observation_space) # get the limits on the observations print (env.observation_space.low) print (env.observation_space.high) # same for action space print (env.action_space) print (env.action_space.low) print (env.action_space.high) # sample a few actions and print them for i in range(5): print (env.action_space.sample()) ``` - Look at the environment definition and understand what the observations mean: `gym/gym/envs/classic_control/pendulum.py` - in particular, look at the `step` function, line 32 and where the observations come from (`get_obs` function, line 57) - What do the different components in the observations stand for? What's `self.state[1]`/`thetadot` in the environment file, what does it mean? - Is this discrete action ("turn left/right") or continuous action ("turn left/right at \[0-1\]x speed")? ### 2) Modify the reward function - At this point, look at the trained PPO policy - If it didn't work, start it again but with more training epochs: `python -m spinup.run ppo --hid "[32,32]" --env Pendulum-v0 --exp_name pend2 --gamma 0.999 --epochs 200` - The reward calculation ("costs", line 42), has different components. What do they stand for and how are they weighted? Option A: - In the reward calculation, increase the weight (penalty) for angle offset and decrease the weight for velocity. - Train again, does it work better or worse? or Option B: - In the reward calculation, add another term that discourages rightward rotation (add a small penalty for negative actions `u`) - Train again, does it work better or worse? Tip: when training, don't just look at the agent's rollouts (video) but also look at the reward over time plot.