Practical_RL - Lecture 4 : Value Function Approximation

# Practical_RL - Lecture 4 : Value Function Approximation [toc] &emsp; &emsp; &emsp; ## Large-Scale Reinforcement Learning 解決 Reforcement Learning 問題的主要方法 : - 建立 MDP 模型 - 計算 value function，並建構一個表 (e.g. Q-table) - 每一個 state 都有一個 V(s) - 每一對的狀態與行動都有一個 Q(s, a) &emsp; 現實中可能出現環境 : - 雙陸棋 : 10^20^ states - 圍棋 : 10^170^ states - 直升機 : 連續的狀態空間 &emsp; ![](https://i.imgur.com/bWvQ5Tk.png) &emsp; &emsp; 在實際環境中，環境的狀態數通常是非常大的，如果我們使用上述的方法，那麼數據的規模也會非常大。 - 有太多的狀態或是行動需要記憶 - 逐一地學習每個狀態的價值太多時間了 - 遇到沒有見過的狀態時，表現會很差 &emsp; &emsp; ## Solution for large MDPs &emsp; 找到一種能夠適用於真實情況 (狀態空間很大) 的 RL 方法建立一個如下所示的近似函數 ![](https://i.imgur.com/49tlTQy.png) - 在值函數中加入了權重向量 w - 如此一來，我們不再需要建構所有狀態或是每一對 (狀態;動作)的表格，並且還能預測我們不曾遇到的狀態 - 參數 w 的更新方式為 MC 或 TD 方法 &emsp; &emsp; &emsp; ![](https://i.imgur.com/sQyJ2BD.png) &emsp; &emsp; &emsp; &emsp; ## Which Function Approximator? 常用的方法有兩個 : - Linear combinations of features 特徵線性組合 - Neural network 神經網路 &emsp; &emsp; &emsp; ## Incremental Methods (增量法) &emsp; :::info 找出一個最合理的權重 w，讓 w 能夠對於每一個輸入的狀態或是 (狀態;動作) 得出合理的價值函數 ::: &emsp; approximate value fn : $\hat{v}$ (s,w) true value fn : $v_π$(s) Mean Squared Value Error : ![](https://i.imgur.com/cTHus2v.png) &emsp; ### Gradient Descent (梯度下降法) &emsp; J(w) 是參數向量 w 的 differentiable function gradient of J(w) : ![](https://i.imgur.com/mJyOecu.png) &emsp; &emsp; &emsp; 使用 GD ，以 w 為變量，可以表示為 : &emsp; ![](https://i.imgur.com/Lr7MDmj.png) &emsp; α is a positive step-size parameter ![](https://i.imgur.com/X6utXON.png) &emsp; &emsp; &emsp; ### Feature Vectors (特徵向量) 特徵向量提取自 agent 當前的狀態 state，得到當前狀態中有用的訊息，根據這些有用的訊息進行梯度下降 &emsp; Represent state by a feature vector : &emsp; ![](https://i.imgur.com/sHF88KS.png) 例子 : - 機器人與地標的距離 - 股市走勢 &emsp; &emsp; &emsp; 將特徵做線性組合，並使用於價值函數的近似 &emsp; ![](https://i.imgur.com/Eld0q5B.png) &emsp; &emsp; &emsp; ### Incremental Prediction Algorithms &emsp; 我們並不知道一個 state 的準確 value 是多少，即 $v_π$(s) 是未知的，因此我們用 MC 與 TD 方法中的 target 來代替 $v_π$(s) &emsp; - MC 方法中的 target 是累積獎勵 $G_t$ - 就像是將一對對的狀態和從環境得到的獎勵作為訓練數據來進行監督學習訓練 &emsp; ![](https://i.imgur.com/HitXroV.png) &emsp; ![](https://i.imgur.com/N3f1nyF.png) &emsp; - TD(0) 方法中的 target 是 $R_{t+1}$ + γ $\hat{v}$($S_{t+1}$,w) - 訓練數據 : &emsp; ![](https://i.imgur.com/DpZ5Jsb.png) &emsp; ![](https://i.imgur.com/DdE4yDB.png) &emsp; δ = R + γ $\hat{v}$(S^'^,w) - $\hat{v}$(S,w) &emsp; - TD(λ) 方法中的 target 是 λ-return $G_t$^λ^ - 訓練數據 : &emsp; ![](https://i.imgur.com/AAaDCgU.png) &emsp; ![](https://i.imgur.com/kOCRKhh.png) &emsp; &emsp; &emsp; &emsp; ## Batch Methods (批處理法) - GD 方法雖然簡單且有吸引力，但是它對於樣本的使用效率並不高，在Incremental Prediction 中，我們每得到一個經驗數據，就以此為基礎，只進行一次梯度更新，更新結束後，這個經驗數據就不再使用了。 - Batch Methods 和 Incremental Methods 最大的區別在於，每一個數據不是只使用一次，而是反覆的用同一批數據進行更新。 - 在Batch方法中，我们希望能夠找到可以 fitting 某一批數據的擬合函數，以此來提高數據的使用率。 &emsp; ### Experience Replay (經驗回放) - 會有一個 buffer 來儲存一些經驗數據，每一次要開始更新的時候，從中抽出一些樣本，然後進行更新，會一直更新到 loss function 達到最小之後並停止更新 &emsp; ![](https://i.imgur.com/kfDePWW.png) &emsp; ### Least Squares Prediction - value function approximation $\hat{v}$(s, w) ≈ $v_π$(s) - 我們使用 Experience Replay Buffer，儲存一定數量的樣本數據，即我們有一個 <state,value> 的 pairs : ![](https://i.imgur.com/Eg55TPe.png) - 使用最小平方誤差作為損失函數，用於 fitting 當前的 Experience Replay Buffer 裡面的整個數據集的數據分布 ![](https://i.imgur.com/0u385q1.png) - 通過隨機梯度下降法進行參數的更新，重複以下步驟： 1. 從 buffer 中採樣： sample <s, v> from D 2. 進行參數更新： ![](https://i.imgur.com/EHicLL8.png) &emsp; &emsp; &emsp; ### Experience Replay in Deep Q-Learning Network (DQN) &emsp; ![](https://i.imgur.com/IKJCIT0.png) &emsp; &emsp; DQN 使用神經網路來近似值函數，神經網路的輸入是 state (或是state 跟 action) ，輸出是 Q (s,a) 值函數網路與 ε - greedy policy 之間的聯繫 : 首先環境會給出一個 obs，agent 根據值函數網路得到關於這個 obs 的所有 Q(s,a)，然後利用 ε - greedy 選擇 action 並做出決策，環境收到此 action 後會給出一個獎勵 Reward 及下一個 obs 。這樣是一個 step ，此時我們根據 Reward 去更新值函數網路的參數。 &emsp; ![](https://i.imgur.com/Q29Plgb.png) Since using histories of arbitrary length as inputs to a neural network can be difficult, our Q-function instead works on fixed length representation of histories produced by a function φ. The function φ from algorithm 1 applies this preprocessing to the last 4 frames of a history and stacks them to produce the input to the Q-function. The input to the neural network consists is an 84 × 84 × 4 image produced by φ. &emsp; 初始化後，對於每次疊代 : - 使用 ε - greedy policy 選擇 action $a_t$ - 將樣本 ( $s_t$,$a_t$,$r_{t+1}$,$s_{t+1}$ ) 儲存到 replay memory D - 從 D 中隨機抽取一個 minibatch 的 ( $s_j$,$a_j$,$r_{j+1}$,$s_{j+1}$ ) - 算出新的 Q-target : $r_j$ + γ max<sub>a^'^</sub> Q ( $s_{j+1}$ , a^'^ ; w ) - loss function : [ $r_j$ + γ max<sub>a^'^</sub> Q ( $s_{j+1}$ , a^'^ ; w ) - Q ( $s_j$ , $a_j$ ; w ) ] ^2^ - 以 w 為變量，對 loss function 梯度下降 &emsp; &emsp; ![](https://i.imgur.com/1O6DQmd.png) &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; 參考資料 : www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html https://docs.google.com/presentation/d/1HEfIyKT0rIuUQCGAsR1PIVGirccDXu5LQvxhVUjuIqM/edit https://zhuanlan.zhihu.com/p/32617897 https://wanjun0511.github.io/2017/11/05/DQN/ https://arxiv.org/abs/1312.5602