RL - HackMD

# RL > Organization contact [name= [ierosodin](ierosodin@gmail.com)] ###### tags: `deep learning` `學習筆記` ==[Back to Catalog](https://hackmd.io/@ierosodin/Deep_Learning)== * PI algorithm requires less stages then VI algorithm to converge to the optimal policy. Yet, it should be noticed that each single stage of PI requires a solution of a set of linear equations (the policy evaluation stage) and therefore it is computationally more expensive than a single stage of VI algorithm. * Proof of policy iteration: * ![](https://i.imgur.com/bE6Smf4.png) * ![](https://i.imgur.com/aE5VHPn.png) * ![](https://i.imgur.com/3WzUSKV.png) * ![](https://i.imgur.com/fJ3nGeJ.png) * ![](https://i.imgur.com/zzefovH.png) * ![](https://i.imgur.com/kjRCdEN.png) * ![](https://i.imgur.com/YMlKZCm.png) * ![](https://i.imgur.com/fl258Jk.png) * ![](https://i.imgur.com/pyoLedH.png) * On policy vs. Off policy * 你所估計的policy或者value-function 和你生成樣本時所採用的policy 是不是一樣。如果一樣，那就是on-policy的，否則是off-policy的。 * 蒙特卡羅學習（Monte Carlo Learning） * 蒙特卡羅(MC)有許多的特性，簡單說有以下幾點： * 主要適用於情境（episodes）方面的學習 * 從經驗中學習，不需要知道模型為何 * 蒙特卡羅的策略函數是由每次結狀態所獲得的獎勵來更新的，透過不斷的修正，最終可以使策略函數越來越準確 * ![](https://i.imgur.com/jdD1G1G.png) * 時間差學習（Temporal-Difference Learning） * TD-learning可以算是蒙特卡羅的強化版，蒙特卡羅每次更新時必須要等到一個episodes的結束，所以相對的效率會較差，而TD-learning則不需要，它的特色就是在每一次狀態結束時就立即更新策略 * ![](https://i.imgur.com/O6YYYPp.png) * 時間差學習的優缺點 * 相較於蒙特卡羅(MC)，TD-learning的優點包含： * TD-learning在每一次的狀態中學習，MC則是等到episodes結束才學 * 因為TD-learning在每一次的狀態學習，所以就算episodes不是完整的也有學習到 * TD-learning不需要episodes，它可以在連續任務中學習 * TD-learning效率比起MC更好 * 但是也因為TD-learning每次都在更新 * 所以會有較高的variance，MC則幾乎不會有 * 比較動態規劃(DP)、蒙特卡羅(MC)以及時間差(TD) * DP： * ![](https://i.imgur.com/6KcNreK.png) * MC： * ![](https://i.imgur.com/m8u09pS.png) * TD： * ![](https://i.imgur.com/VAcBByS.png) * MC等到episodes結束再一次更新，DP以及TD則可以持續更新 * 但DP的更新是利用模型計算而來 * TD的更新則是試驗而來 * On-policy以及Off-policy * On-policy意即從同一策略的經驗中學來並改進 * Off-policy則是從其他地方學習來，就像真的在學習一樣 * Sarsa * Sarsa是一種On-policy的學習方法，對於On-plicy的方法來說，我們必須要能夠對當前的 s 以及 a 進行估計，也就是得到動作價值函數 qπ(s,a)，我們也可以透過下圖更清楚的了解 episodes 之間的組成 * ![](https://i.imgur.com/klphOnE.png) * Sarar的由來就是他描述了 state-action pair 到下一個 state-action pair 的轉換 * ![](https://i.imgur.com/p9xsrPt.png) * 這是Sarsa的算法，終止的條件決定於我們是採用哪種策略 * ![](https://i.imgur.com/pSGwTmm.png) * Q-Learning * Q-Learning是一種Off-policy的學習方法，他和Sarsa很像，但off-policy的優點就在於，我們可以重複利用舊有的策略，來不停的試驗，差異在於Q-Learning直接計算所有並貪婪的選擇最大的動作價值函數 Q * ![](https://i.imgur.com/lgekfm8.png) * * 這是Q-learning的算法 * ![](https://i.imgur.com/jISWq4q.png)