# Practical_RL - Lecture 2 : Dynamic Programming [TOC]     ## Given dynamics, how to find an optimal policy ?   ![](https://i.imgur.com/qPBtDES.png)       ![](https://i.imgur.com/pjhNkKs.png)   ![](https://i.imgur.com/OjelHP0.png)     ### **Reward discounting : γ** **0 ≤ γ < 1** ![](https://i.imgur.com/ByX2EeY.png)     - 折扣能夠使獎勵的總和變為有限的 ![](https://i.imgur.com/TM80oPR.png)     :::info **為什麼要使用折扣率 γ ?**   1. 折扣獎勵方便數學上的運算 2. 避免MDP過程中產生無限的回報 3. 關於未來不確定性的體現 4. 如果獎勵是金融獎勵,立即獎勵可以賺取更多利息而不是延遲獎勵 :::     - reward only for WHAT, but never for HOW 當我們的目標為 "累積 Reward" 最大化時,agent 的行動模式跟我們的預測可能會出現意料外的差別 。 在對於agent與環境不全然了解且獎勵設計有誤時,agent 可能會為了得到最多的獎勵而選擇不完成任務,或是採用我們不希望的方法結束任務。         ### **State- and Action-value functions :**   ![](https://i.imgur.com/RdajUT7.png)     **State-value function v(s) :**   ![](https://i.imgur.com/c4m9VnU.png)     **Action-value function q(s, a) :**   ![](https://i.imgur.com/jHWS14K.png)       ### **[Bellman](https://hackmd.io/MXCru1uRQ4iUeAnMJzVdZA) Expectation Equation :**   ![](https://i.imgur.com/jOTOrRH.png)         ![](https://i.imgur.com/IKWCac4.png)     ### **Optimal Value Function :** 強化學習最重要的點在於找到一個最好的Policy (策略),讓 Reward 可以最大化   ![](https://i.imgur.com/EEZA8ij.png)     ![](https://i.imgur.com/cpcWSNk.png)   ![](https://i.imgur.com/qMUbcWd.png)         ### **Generalized Policy Iteration :**   #### 1. Policy Evaluation #### 2. Policy Improvement     ![](https://i.imgur.com/R9irH46.png)     ![](https://i.imgur.com/UPL5cTt.png)     ![](https://i.imgur.com/S5JP6Cm.png)   -1.7 的計算 :0.25 [(-1)+(-1)] 3 + 0.25 [(-1)+0] = -1.75 -2.0 的計算 : 0.25 [(-1)+(-1)] 4 = -2 -2.4 的計算 :0.25 [(-1)+(-2)] 2+ 0.25 [(-1)+0] + 0.25 [(-1)+(-1.7)] = -2.425     ![](https://i.imgur.com/PPMNnbU.png)             參考資料 : http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/DP.pdf https://docs.google.com/presentation/d/1lz2oIUTvd2MHWKEQSH8hquS66oe4MZ_eRvVViZs2uuE https://blog.csdn.net/mmc2015/article/details/52859611