Reproduce DR-DRL

# Reproduce DR-DRL [TOC] --- ### 目標 | Saliency Maps | Reward curve | | -------- | -------- | | ![](https://i.imgur.com/64gkQa6.png) | ![](https://i.imgur.com/4fylVWD.png) | ### (5/9) 會議記錄 **目標：** Rainbow Reproduction **已完成：** YB, JY: Ranbow Code (Github : https://github.com/Kaixhin/Rainbow) CH: Visualization Tool (Github : https://github.com/greydanus/visualize_atari) **下周目標(deadline 5/16)：** YB: complete the code. CH: ranbow reproduction. JY: paper survey for reward decomposition. --- ### (5/16) 會議記錄 **上週目標：** Implementation sprint 1. (進度回報YB->CY->CH) **下週目標(deadline 5/23)：** Implementation sprint 2. YB: DRDRL code確認可以正確運行，可以先挑某個遊戲train 2500000 steps (10 epochs)測試。下周進度必須給出reward curve。 CH: Rainbow training，把training details與hyper-parameters細節總結(Deadline: 5/18 23:59)，下周需將saliency maps必須跑起來。 JY: Paper survey for reward decomposition。paper survey，找cite DRDRL的paper。等YB code出來之後，看要不要來幫忙training。 --- ### (5/23) 會議記錄 **上週目標：** Implementation sprint 2. (進度回報YB->JY->CH) **下週目標(deadline 5/30)：** Final Week: finish the work! YB: DRDRL code確認可以正確運行，可以先挑某個遊戲train 2500000 steps (10 epochs)測試。下周進度必須給出reward curve。總結如何使用你的扣，如何下指令訓練。 CH: 整理training guide，訓練Rainbow在Hero(100 epochs)，修正saliency map的code。 JY: Paper survey for reward decomposition~ --- ### (5/31) 會議記錄 **上週目標：** Final Week: finish the work! (Anna 被關掉真的無法) **下週目標(deadline 6/07)：** Final Week Again: finish the work! YB: 基本上延續上周，DRDRL code確認可以正確運行，可以先挑某個遊戲train 2500000 steps (10 epochs)測試。下周進度必須給出reward curve。總結如何使用你的扣，如何下指令訓練。(做完回報，我們可以討論後續訓練的部分(? if Anna is avaliable.)) CH: 訓練Rainbow(100 epochs)。若彥斌寫好扣，可以協助訓練DRDRL training。 JY: Paper survey for reward decomposition~ --- ### (6/7) 會議記錄 **Johnson** Bugs in current DR-DRL code: 1. Colorful Saliency 2. agent.act: with no_grad? 3. model.forward: no out (intermediate tensor) 4. model.embedding_setup: onehots may be altered by gradient descent 5. No additional variable_states in model => return multiple values instead --- ### (6/14) 會議記錄 **Schedule** June 20 | Day off June 27 | drdrl code, rainbow training, saliency map **TODOs:** - continue paper survey (@FJY) (@Lance) - DRDRL: - Debug Policy Network Bug (@YB) - Write Disentangled Regularization Term (@YB) - List Regulzarization Term Searching Space (@YB) - Run in search space (@Johnson) - Finish Tuning - Saliency: - Colorful (@Lance) - Rainbow - Run on all maps (@Johnson) (2500000 steps, ? runs) --- ### (6/28) 會議記錄 **Schedule** Finish these tasks before 7/5 (Sun). **TODOs:** - Rainbow Training (@Johnson) - Remember to keep `checkpoint.pth`, `metrics.pth`, `model.pth`. - DRDRL Training (@YB) - log probability - loss - Tuning - Remember to keep `checkpoint.pth`, `metrics.pth`, `model.pth`. - Saliency maps (@Lance) - Paper Survey (@Lance) (@FJY) ### (7/12) 會議記錄 **Schedule** <img src='https://i.imgur.com/QirSgOK.png' width='550px'></b> **DRDRL-2 (Seaquest)** - lambda 0.5 - 10 epoches ![](https://i.imgur.com/KPgJHUO.png) **DRDRL-2 (Hero)** - $\lambda=6.25e-5$ ![](https://i.imgur.com/O1rte2g.png) **7/13-7/17 TODOs (step by step)** Weekly Goal: Tune Lambda on NVIDIA NGC. - Hyperparameter tuning (@Lance, @Tiger) - lambda (assuming we have 16-32 parallel computers) - Read curve points Or log using tensorboard (@Lance) - Checkpoint (Load model) (@Lance) - `results/<game-id>/checkpoint.pth` - `results/<game-id>/model.pth` - How to load model for testing (command) - Code Refinement (@Tiger) - lambda as a command param - num of channels as a command param (2 or 3) - Build Docker (@Bob, @Johnson) - Write Dockerfile for Rainbow & DR-DRL - Paper Survey (@FJY) - Reward decomposition - Detailed Notes (@Johnson) - Tune lambda on NVIDIA NGC (@Johnson) ### (7/20) 會議記錄 @Johnson 拿到 code & readme code 裡面可以加註解 readme 裡面，command & args 和說明可以 detail 一點（像是 opensource project），還有環境建置步驟等等都可以放 @Johnson 要跑的實驗 (All on 6 maps) - Performance & Time (Wall Clock Time) Time: Sum of training time (Only on training) - Log 每次 testing 完，log 目前的 timespan "Average test reward": (Timestep, Test Average Reward) "Wall clock time": (Timestep, WallClockTime) 從零開始以秒為單位紀錄？ Time: Sum of testing time (Only on testing) [Reproduce] - Rainbow - DR-DRL (2 channels) - DR-DRL (3 channels) [Ours] - (+) DR-DRL (Addition) - DR-DRL+FFT (2 channels) - DR-DRL+FFT (3 channels) - DR-DRL+ExpectationRegularization (2 channels, expectation instead of disentanglement) - DR-DRL+ExpectationRegularization (3 channels, expectation instead of disentanglement) [More...] --- ### Coding - the empirical value of loss is about 3.6 - the disentangle loss is on [0, -36] (1e-8 based) --- ### Bugs - (solved) conv1d implementation <img src='https://i.imgur.com/kk6HY6x.png' width='550px'> - (solved) outdated package replaced by Pillow - (solved) regualrization term modified (to avoid issues) - (solved) conv1d modified to fully utilize GPUs - (solved) explicit device assignment - (solved) evaluation bugs - <font color="red">**(unsolved)**</font> bugs in cpp when redering - (solved) disentangle loss - (solved) perform square root on (potentially) negative loss: take exp - (solved) perform log on zero (KL divergence for disentanglement): add 1e-8 ``` jd12 tensor([[nan]]) jd21 tensor([[nan]]) jd tensor([[nan]]) ``` ![](https://i.imgur.com/HxW0FKk.png) --- ### FJY Paper List #### TODO - Branches of reward decomposition paper (10~20++) - Summary (2-3 points) of each paper ***0525*** - [Distributional reinforcement learning with linear function approximation](https://arxiv.org/abs/1902.03149), AISTATS 2019 - [Distributional Reinforcement Learning with Quantile Regression](https://arxiv.org/abs/1710.10044), AAAI 2018 - [Implicit Quantile Networks for Distributional Reinforcement Learning](https://arxiv.org/abs/1806.06923), ICML 2018 ***0607*** - [RUDDER: Return Decomposition for Delayed Rewards](https://arxiv.org/abs/1806.07857), NIPS 2019 - [Distributional reinforcement learning with linear function approximation](https://arxiv.org/abs/1902.03149) ***0628*** - [Fully Parameterized Quantile Function for Distributional Reinforcement Learning](https://arxiv.org/abs/1911.02140), NIPS 2019 - [Distributional Deep Reinforcement Learning with a Mixture of Gaussians](https://ieeexplore.ieee.org/abstract/document/8793505), ICRA 2019 --- ### Training Guide 這篇將會手把手幫助你setup那份rainbow code。(不要問我為什麼我不用pretrained model: https://github.com/Kaixhin/Rainbow/releases ，因為我跑不起來QQ，如果你/妳會也可以跟我說~而且他沒有StarGunner，我們還是得自己train！) - Step 1: 第一步就是把它clone下來-->`git clone https://github.com/Kaixhin/Rainbow.git` - Step 2: 確定安裝`atari-py`、`OpenCV Python`、`Plotly`、`PyTorch` - Step 3: 修改code當中小bug，`ctrl+F`搜尋找到以下程式碼，將A改為B: A: `parser.add_argument('--checkpoint-interval', default=0, help='How often to checkpoint the model, defaults to 0 (never checkpoint)')` B: `parser.add_argument('--checkpoint-interval', type=int, default=0, help='How often to checkpoint the model, defaults to 0 (never checkpoint)')` - Step 4: 開始訓練，這部分我確認過他的hyper-parameters的defualt值都跟Rainbow論文中是一致的，所以不用改。這裡主要是要設定一些檔案位置跟遊戲的選擇。其中，要改的參數有`--id`、`--game`、`--T-max`、`evaluation-interval`、`--checkpoint-interval`、`--memory`。以下示範一個訓練套餐(以訓練hero為範例)： - 首先進入剛下載的Rainbow資料夾(`cd Rainbow`)，在results資料夾下創建一個檔案，命名為`memory_hero`(`cd results`、`touch memory_hero`)。 - 回到Rainbow資料夾，輸入以下指令：`python main.py --id hero --game hero --T-max 25000000 --evaluation-interval 125000 --checkpoint-interval 1250000 --memory ./results/memory_hero`，剛開始要等有點久他會感覺沒反應，是因為他在load memory，或跑訓練前的一些步驟(規範在`--learn-start`，不過不要動它，他也是hyper-parameters的一部份)。 - Step 5: 基本上就會出現以下畫面代表開始進行訓練 <img src='https://i.imgur.com/eEw9cbT.png' width='450px'> - Step 6: 訓練完成後會在results/id資料夾下看到training的結果(id照剛剛的例子是hero)。 <img src='https://i.imgur.com/FT8S2R2.png' width='200px'> ``` c= // 實用的指令表: // Evaluation: python main.py --id hero \ --game hero \ --T-max 1625000 \ --model ./results/hero/model.pth \ --evaluation-interval 125000 \ --checkpoint-interval 500000 \ --memory ./results/memory_hero // drdrl training python main.py --id hero \ --game hero \ --T-max 25000000 \ --evaluation-interval 1000 \ --checkpoint-interval 20000 \ --memory ./results/memory_hero // Gain saliency maps python main.py --id seaquest \ --game seaquest \ --require-saliency-map \ --evaluate \ --model ./results/seaquest/model.pth \ --memory ./results/memory_seaquest // --------------------------------- // Check if the GPU is running nvidia-smi -l 1 // Select the GPU echo $CUDA_VISIBLE_DEVICES export CUDA_VISIBLE_DEVICES=0,1 // Check the version python -m torch.utils.collect_env // --------------------------------- // (new) // Training... python main.py --id hero\ --game hero \ --T-max 25000000 \ --learn-start 20000 \ --evaluation-interval 20000 \ --checkpoint-interval 1250000 \ --memory ./results/memory_hero_test \ --network drdrl \ --mode fft-convolution \ --num-channels 2 ``` --- ```python= # read_curve.py import torch metrics = {'steps': [], 'rewards': [], 'Qs': [], 'best_avg_reward': -float('inf')} metrics = torch.load('./results/seaquest/metrics.pth') print(metrics['steps']) ''' metrics['steps'] is an array containing integers in step of evaluation-interval. (EI) e.g. [125000, 250000, 375000, 500000, 625000, 750000, 875000, 1000000, 1125000, 1250000, 1375000, 1500000, 1625000, 1750000, ...] ... metrics['rewards'] is an array containing integers in step of evaluation-interval and each of the them includes evaluation-episodes elements. (EI*EE) e.g. [[200, 200, 200, 200, 200, 200, 200, 200, 200, 200], [640, 640, 640, 640, 640, 640, 640, 640, 640, 640], ...] ... ''' ``` --- ### Saliency Maps <img src='https://i.imgur.com/ZKq2wxZ.gif' width='250px'> --- ### DRDRL code - ***Network Structure*** <img src='https://i.imgur.com/cgKFr1q.png' width='550px'></b> ### Notes $1\ \text{epoch}=0.25M\ \text{steps}$