# Reproduce DR-DRL
[TOC]
---
### 目標
| Saliency Maps | Reward curve |
| -------- | -------- |
|  |  |
### (5/9) 會議記錄
**目標:**
Rainbow Reproduction
**已完成:**
YB, JY: Ranbow Code (Github : https://github.com/Kaixhin/Rainbow)
CH: Visualization Tool (Github : https://github.com/greydanus/visualize_atari)
**下周目標(deadline 5/16):**
YB: complete the code.
CH: ranbow reproduction.
JY: paper survey for reward decomposition.
---
### (5/16) 會議記錄
**上週目標:**
Implementation sprint 1.
(進度回報YB->CY->CH)
**下週目標(deadline 5/23):**
Implementation sprint 2.
YB: DRDRL code確認可以正確運行,可以先挑某個遊戲train 2500000 steps (10 epochs)測試。下周進度必須給出reward curve。
CH: Rainbow training,把training details與hyper-parameters細節總結(Deadline: 5/18 23:59),下周需將saliency maps必須跑起來。
JY: Paper survey for reward decomposition。paper survey,找cite DRDRL的paper。等YB code出來之後,看要不要來幫忙training。
---
### (5/23) 會議記錄
**上週目標:**
Implementation sprint 2.
(進度回報YB->JY->CH)
**下週目標(deadline 5/30):**
Final Week: finish the work!
YB: DRDRL code確認可以正確運行,可以先挑某個遊戲train 2500000 steps (10 epochs)測試。下周進度必須給出reward curve。總結如何使用你的扣,如何下指令訓練。
CH: 整理training guide,訓練Rainbow在Hero(100 epochs),修正saliency map的code。
JY: Paper survey for reward decomposition~
---
### (5/31) 會議記錄
**上週目標:**
Final Week: finish the work!
(Anna 被關掉真的無法)
**下週目標(deadline 6/07):**
Final Week Again: finish the work!
YB: 基本上延續上周,DRDRL code確認可以正確運行,可以先挑某個遊戲train 2500000 steps (10 epochs)測試。下周進度必須給出reward curve。總結如何使用你的扣,如何下指令訓練。(做完回報,我們可以討論後續訓練的部分(? if Anna is avaliable.))
CH: 訓練Rainbow(100 epochs)。若彥斌寫好扣,可以協助訓練DRDRL training。
JY: Paper survey for reward decomposition~
---
### (6/7) 會議記錄
**Johnson**
Bugs in current DR-DRL code:
1. Colorful Saliency
2. agent.act: with no_grad?
3. model.forward: no out (intermediate tensor)
4. model.embedding_setup: onehots may be altered by gradient descent
5. No additional variable_states in model => return multiple values instead
---
### (6/14) 會議記錄
**Schedule**
June 20 | Day off
June 27 | drdrl code, rainbow training, saliency map
**TODOs:**
- continue paper survey (@FJY) (@Lance)
- DRDRL:
- Debug Policy Network Bug (@YB)
- Write Disentangled Regularization Term (@YB)
- List Regulzarization Term Searching Space (@YB)
- Run in search space (@Johnson)
- Finish Tuning
- Saliency:
- Colorful (@Lance)
- Rainbow
- Run on all maps (@Johnson) (2500000 steps, ? runs)
---
### (6/28) 會議記錄
**Schedule**
Finish these tasks before 7/5 (Sun).
**TODOs:**
- Rainbow Training (@Johnson)
- Remember to keep `checkpoint.pth`, `metrics.pth`, `model.pth`.
- DRDRL Training (@YB)
- log probability
- loss
- Tuning
- Remember to keep `checkpoint.pth`, `metrics.pth`, `model.pth`.
- Saliency maps (@Lance)
- Paper Survey (@Lance) (@FJY)
### (7/12) 會議記錄
**Schedule**
<img src='https://i.imgur.com/QirSgOK.png' width='550px'></b>
**DRDRL-2 (Seaquest)**
- lambda 0.5
- 10 epoches

**DRDRL-2 (Hero)**
- $\lambda=6.25e-5$

**7/13-7/17 TODOs (step by step)**
Weekly Goal: Tune Lambda on NVIDIA NGC.
- Hyperparameter tuning (@Lance, @Tiger)
- lambda (assuming we have 16-32 parallel computers)
- Read curve points Or log using tensorboard (@Lance)
- Checkpoint (Load model) (@Lance)
- `results/<game-id>/checkpoint.pth`
- `results/<game-id>/model.pth`
- How to load model for testing (command)
- Code Refinement (@Tiger)
- lambda as a command param
- num of channels as a command param (2 or 3)
- Build Docker (@Bob, @Johnson)
- Write Dockerfile for Rainbow & DR-DRL
- Paper Survey (@FJY)
- Reward decomposition
- Detailed Notes (@Johnson)
- Tune lambda on NVIDIA NGC (@Johnson)
### (7/20) 會議記錄
@Johnson 拿到 code & readme
code 裡面可以加註解
readme 裡面,command & args 和說明可以 detail 一點(像是 opensource project),還有環境建置步驟等等都可以放
@Johnson 要跑的實驗 (All on 6 maps)
- Performance & Time (Wall Clock Time)
Time: Sum of training time (Only on training)
- Log
每次 testing 完,log 目前的 timespan
"Average test reward": (Timestep, Test Average Reward)
"Wall clock time": (Timestep, WallClockTime) 從零開始以秒為單位紀錄?
Time: Sum of testing time (Only on testing)
[Reproduce]
- Rainbow
- DR-DRL (2 channels)
- DR-DRL (3 channels)
[Ours]
- (+) DR-DRL (Addition)
- DR-DRL+FFT (2 channels)
- DR-DRL+FFT (3 channels)
- DR-DRL+ExpectationRegularization (2 channels, expectation instead of disentanglement)
- DR-DRL+ExpectationRegularization (3 channels, expectation instead of disentanglement)
[More...]
---
### Coding
- the empirical value of loss is about 3.6
- the disentangle loss is on [0, -36] (1e-8 based)
---
### Bugs
- (solved) conv1d implementation
<img src='https://i.imgur.com/kk6HY6x.png' width='550px'>
- (solved) outdated package replaced by Pillow
- (solved) regualrization term modified (to avoid issues)
- (solved) conv1d modified to fully utilize GPUs
- (solved) explicit device assignment
- (solved) evaluation bugs
- <font color="red">**(unsolved)**</font> bugs in cpp when redering
- (solved) disentangle loss
- (solved) perform square root on (potentially) negative loss: take exp
- (solved) perform log on zero (KL divergence for disentanglement): add 1e-8
```
jd12 tensor([[nan]])
jd21 tensor([[nan]])
jd tensor([[nan]])
```

---
### FJY Paper List
#### TODO
- Branches of reward decomposition paper (10~20++)
- Summary (2-3 points) of each paper
***0525***
- [Distributional reinforcement learning with linear function approximation](https://arxiv.org/abs/1902.03149), AISTATS 2019
- [Distributional Reinforcement Learning with Quantile Regression](https://arxiv.org/abs/1710.10044), AAAI 2018
- [Implicit Quantile Networks for Distributional Reinforcement Learning](https://arxiv.org/abs/1806.06923), ICML 2018
***0607***
- [RUDDER: Return Decomposition for Delayed Rewards](https://arxiv.org/abs/1806.07857), NIPS 2019
- [Distributional reinforcement learning with linear function approximation](https://arxiv.org/abs/1902.03149)
***0628***
- [Fully Parameterized Quantile Function for Distributional Reinforcement Learning](https://arxiv.org/abs/1911.02140), NIPS 2019
- [Distributional Deep Reinforcement Learning with a Mixture of Gaussians](https://ieeexplore.ieee.org/abstract/document/8793505), ICRA 2019
---
### Training Guide
這篇將會手把手幫助你setup那份rainbow code。(不要問我為什麼我不用pretrained model: https://github.com/Kaixhin/Rainbow/releases ,因為我跑不起來QQ,如果你/妳會也可以跟我說~而且他沒有StarGunner,我們還是得自己train!)
- Step 1: 第一步就是把它clone下來-->`git clone https://github.com/Kaixhin/Rainbow.git`
- Step 2: 確定安裝`atari-py`、`OpenCV Python`、`Plotly`、`PyTorch`
- Step 3: 修改code當中小bug,`ctrl+F`搜尋找到以下程式碼,將A改為B:
A: `parser.add_argument('--checkpoint-interval', default=0, help='How often to checkpoint the model, defaults to 0 (never checkpoint)')`
B: `parser.add_argument('--checkpoint-interval', type=int, default=0, help='How often to checkpoint the model, defaults to 0 (never checkpoint)')`
- Step 4: 開始訓練,這部分我確認過他的hyper-parameters的defualt值都跟Rainbow論文中是一致的,所以不用改。這裡主要是要設定一些檔案位置跟遊戲的選擇。其中,要改的參數有`--id`、`--game`、`--T-max`、`evaluation-interval`、`--checkpoint-interval`、`--memory`。以下示範一個訓練套餐(以訓練hero為範例):
- 首先進入剛下載的Rainbow資料夾(`cd Rainbow`),在results資料夾下創建一個檔案,命名為`memory_hero`(`cd results`、`touch memory_hero`)。
- 回到Rainbow資料夾,輸入以下指令:`python main.py --id hero --game hero --T-max 25000000 --evaluation-interval 125000 --checkpoint-interval 1250000 --memory ./results/memory_hero`,剛開始要等有點久他會感覺沒反應,是因為他在load memory,或跑訓練前的一些步驟(規範在`--learn-start`,不過不要動它,他也是hyper-parameters的一部份)。
- Step 5: 基本上就會出現以下畫面代表開始進行訓練
<img src='https://i.imgur.com/eEw9cbT.png' width='450px'>
- Step 6: 訓練完成後會在results/id資料夾下看到training的結果(id照剛剛的例子是hero)。
<img src='https://i.imgur.com/FT8S2R2.png' width='200px'>
``` c=
// 實用的指令表:
// Evaluation:
python main.py --id hero \
--game hero \
--T-max 1625000 \
--model ./results/hero/model.pth \
--evaluation-interval 125000 \
--checkpoint-interval 500000 \
--memory ./results/memory_hero
// drdrl training
python main.py --id hero \
--game hero \
--T-max 25000000 \
--evaluation-interval 1000 \
--checkpoint-interval 20000 \
--memory ./results/memory_hero
// Gain saliency maps
python main.py --id seaquest \
--game seaquest \
--require-saliency-map \
--evaluate \
--model ./results/seaquest/model.pth \
--memory ./results/memory_seaquest
// ---------------------------------
// Check if the GPU is running
nvidia-smi -l 1
// Select the GPU
echo $CUDA_VISIBLE_DEVICES
export CUDA_VISIBLE_DEVICES=0,1
// Check the version
python -m torch.utils.collect_env
// ---------------------------------
// (new)
// Training...
python main.py --id hero\
--game hero \
--T-max 25000000 \
--learn-start 20000 \
--evaluation-interval 20000 \
--checkpoint-interval 1250000 \
--memory ./results/memory_hero_test \
--network drdrl \
--mode fft-convolution \
--num-channels 2
```
---
```python=
# read_curve.py
import torch
metrics = {'steps': [], 'rewards': [], 'Qs': [], 'best_avg_reward': -float('inf')}
metrics = torch.load('./results/seaquest/metrics.pth')
print(metrics['steps'])
'''
metrics['steps'] is an array containing integers in step of evaluation-interval. (EI)
e.g. [125000, 250000, 375000, 500000, 625000, 750000, 875000, 1000000, 1125000,
1250000, 1375000, 1500000, 1625000, 1750000, ...]
...
metrics['rewards'] is an array containing integers in step of evaluation-interval
and each of the them includes evaluation-episodes elements. (EI*EE)
e.g. [[200, 200, 200, 200, 200, 200, 200, 200, 200, 200],
[640, 640, 640, 640, 640, 640, 640, 640, 640, 640], ...]
...
'''
```
---
### Saliency Maps
<img src='https://i.imgur.com/ZKq2wxZ.gif' width='250px'>
---
### DRDRL code
- ***Network Structure***
<img src='https://i.imgur.com/cgKFr1q.png' width='550px'></b>
### Notes
$1\ \text{epoch}=0.25M\ \text{steps}$