# Nepal Winter School 2019, RL lab assignment
## Setup
### ONLY ON WINDOWS:
Go to https://www.microsoft.com/en-us/download/details.aspx?id=100593, download the program and install it. Also have Anaconda Python installed in version 3.6 or 3.7.
You also need to install some other packages for python:
conda install -c anaconda mpi4py
conda install -c conda-forge pystan
conda install -c conda-forge pyglet
conda install -c conda-forge swig
pip install Box2D # to use box2d physics-based tasks
pip install atari-py # to use Atari game tasks
### Everyone:
I assume that you have Python 3.6/3.7 and git installed. If you don't, check out Miniconda: https://docs.conda.io/en/latest/miniconda.html
Download and install OpenAI Gym:
git clone https://github.com/openai/gym.git
cd gym
# install gym in developer mode
pip install -e .
Download and install OpenAI SpinningUp:
# IF you're in the gym directory, go one level up
cd ..
# ONLY IF you're on Ubuntu
sudo apt-get update && sudo apt-get install libopenmpi-dev
# ONLY IF your'e on Mac
brew install openmpi
git clone https://github.com/openai/spinningup.git
cd spinningup
pip install -e .
Verify installation by training a simple PPO agent
python -m spinup.run ppo --hid "[16,16]" --env Pendulum-v0 --exp_name installtest --gamma 0.999
# THIS WILL TRAIN FOR A QUITE A FEW MINUTES (~30min on my Mac)
# BUT please stop it after max 3min. This is just to verify everything works
# Look at the (partially) learned policy in action
python -m spinup.run test_policy data/installtest/installtest_s0
# Plot the reward over time
python -m spinup.run plot data/installtest/installtest_s0
## Assignment
### 1) Understand the environment
- While you're doing the next few steps, train a PPO agent on this environment, like we did above, but give it 10-15min, and change `installtest` to `pendulum` or something else you like
- Create a new Python script
- Implement the basic gym loop:
```python
import gym
env = gym.make("Pendulum-v0")
while True:
obs = env.reset()
done = False
while not done:
# in practice the action comes from your policy
action = env.action_space.sample()
obs, rew, done, misc = env.step(action)
# optional
env.render()
```
- Inspect the observations dimensions and action dimensions:
```python
# this gives you the dimensions but not the upper lower bound
print (env.observation_space)
# get the limits on the observations
print (env.observation_space.low)
print (env.observation_space.high)
# same for action space
print (env.action_space)
print (env.action_space.low)
print (env.action_space.high)
# sample a few actions and print them
for i in range(5):
print (env.action_space.sample())
```
- Look at the environment definition and understand what the observations mean:
`gym/gym/envs/classic_control/pendulum.py`
- in particular, look at the `step` function, line 32 and where the observations come from (`get_obs` function, line 57)
- What do the different components in the observations stand for? What's `self.state[1]`/`thetadot` in the environment file, what does it mean?
- Is this discrete action ("turn left/right") or continuous action ("turn left/right at \[0-1\]x speed")?
### 2) Modify the reward function
- At this point, look at the trained PPO policy
- If it didn't work, start it again but with more training epochs:
`python -m spinup.run ppo --hid "[32,32]" --env Pendulum-v0 --exp_name pend2 --gamma 0.999 --epochs 200`
- The reward calculation ("costs", line 42), has different components. What do they stand for and how are they weighted?
Option A:
- In the reward calculation, increase the weight (penalty) for angle offset and decrease the weight for velocity.
- Train again, does it work better or worse?
or
Option B:
- In the reward calculation, add another term that discourages rightward rotation (add a small penalty for negative actions `u`)
- Train again, does it work better or worse?
Tip: when training, don't just look at the agent's rollouts (video) but also look at the reward over time plot.