Practical_RL - Lecture 1 : Introduction to Reinforcement Learning

# Practical_RL - Lecture 1 : Introduction to Reinforcement Learning [toc] &emsp; &emsp; ## Reinforcement Learning &emsp; **Branches of Machine Learning :** &emsp; ![](https://i.imgur.com/aPpDOZR.png) &emsp; &emsp; **Reinforcement Learning 的簡單概念 :** 常見的想法 : - Get data by trial and error and error and error and error - Learn (situation) → (optimal action) - Repeat ![](https://i.imgur.com/nQdqn61.png) &emsp; &emsp; &emsp; **Decision process : Agent and Environment** &emsp; ![](https://i.imgur.com/3gcH4Mq.png) &emsp; &emsp; ![](https://i.imgur.com/r1bykFA.png) &emsp; &emsp; 強化學習 (Reinforcement Learning) 與其他機器學習方法不同的地方 : - There is no supervisor, only a reward signal - Feedback is delayed, not instantaneous - Agent’s actions affect the subsequent data it receives &emsp; &emsp; **Examples of Reinforcement Learning :** &emsp; 1. Reality check : dynamic systems &emsp; ![](https://i.imgur.com/o11qFG7.jpg) &emsp; &emsp; &emsp; ![](https://i.imgur.com/fThlSbt.png) &emsp; &emsp; 2. ~~Reality~~ check : videogames &emsp; ![](https://i.imgur.com/fdPRQkw.png) &emsp; &emsp; &emsp; &emsp; **Reward :** - Reward $R_t$ 是一個 scalar feedback signal - 可以從 reward 看出 agent 在 step t 的表現狀況 - agent 的工作就是讓累積得到的 reward 最大化 &emsp; ![](https://i.imgur.com/OEs8IVD.png) &emsp; &emsp; &emsp; &emsp; &emsp; **Markov Decision Process (馬可夫決策過程) :** MDP 是在環境中模擬 agent 的策略（policy）與回報的數學模型，且環境的狀態具有馬可夫性質。馬可夫性質 : 是機率論中的一個概念，當一個隨機過程在給定現在狀態及所有過去狀態情況下，其未來狀態的條件機率分布僅依賴於當前狀態；換句話說，在給定現在狀態時，他與過去狀態（及該過程的歷史路徑）是條件獨立的，那麼此隨機過程及具有馬可夫性質 &emsp; ![](https://i.imgur.com/yuePmxF.png) $P_.$($s_{t+1}$|$s_t$,$a_t$) 為在 $s_t$ 狀態採取 $a_t$ 的動作下與環境互動能得到 $s_{t+1}$ 的機率 &emsp; MDP 圖示 : &emsp; ![](https://i.imgur.com/pPzlKuC.png) &emsp; &emsp; 狀態 State 具有馬可夫性質 : &emsp; ![](https://i.imgur.com/GC1kyqD.png) &emsp; &emsp; Markov assumption : ![](https://i.imgur.com/Zzb2ysg.png) &emsp; &emsp; &emsp; &emsp; **Exploration and Exploitation (探索和利用) : 動作的選擇** - Reinforcement learning is like trial-and-error learning - The agent should discover a good policy from its experiences of the environment - Exploration finds more information about the environment - Exploitation exploits known information to maximise reward - It is usually important to explore as well as exploit &emsp; Examples : - 餐廳的選擇 Exploitation : 去你最喜歡的餐廳 Exploration : 試著去一間新的餐廳 - 玩遊戲 Exploitation : 做出你認為最好的行動 Exploration : 做出實驗性的隨機行動 &emsp; &emsp; &emsp; &emsp; ![](https://i.imgur.com/ilYoMC2.png) &emsp; &emsp; &emsp; &emsp; ![](https://i.imgur.com/uaBQXsD.png)