Tsen
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.

      Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Explore these features while you wait
      Complete general settings
      Bookmark and like published notes
      Write a few more notes
      Complete general settings
      Write a few more notes
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.

    Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Explore these features while you wait
    Complete general settings
    Bookmark and like published notes
    Write a few more notes
    Complete general settings
    Write a few more notes
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    3
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    --- tags: RL --- # Reinforcement Learning | Introduction and Model-Based Control 111學年第二學期選修交大資工所謝秉均老師開設之強化學習原理。 因為觀念複雜且不容易消化,所以決定在每次上完課之後,好好把課堂學到的東西整理成筆記,內化成自己的。 (希望我能堅持下去QQ) 另外我在上這堂課的同時,也有看 Stanford 的線上 RL 課程,所以筆記可能也會稍微整合一些些 Stanford 線上課的內容。 Reinforcement Learning | Stanford Online: https://www.youtube.com/watch?v=FgzM3zpZ55o&list=PLoROMvodv4rOSOPzutgyCTapiGlY2Nd8u&index=1 ## Introduction ### Overview 我們可以用一句話簡述強化學習 (Reinforcement Learning): Learn to take a sequences of actions to achieve a specific goal in an unknown environment. - Learn: 表示我們事先不知道環境是如何活動的 - Sequence of actions: 表示和環境多次的互動 - Goal: 例如獲得高 reward RL 的四個重點議題,也讓 RL 和其他機器學習方法有所分別: * Optimization: Agent 需要最佳化自己的 action 來獲得最大的 reward * Exploration: 透過與環境互動來嘗試不同的決定並學習,但可能會面臨 exploration 與 exploitation 兩者的 trade off * Delayed consequences: 某個決定可能之後才有影響 * Generalization: Agent 是否能判斷某個 action 在之前沒經歷過的 state 下的好壞? > 比較: > Supervised Learning: Optimization, Generalization > Imitation Learning: Optimization, Generalization, Delayed consequences ### Sequential Decision Making Agent 和 Environment 的互動可以用下圖來表示: ![](https://i.imgur.com/CsjbwV1.png) > 另一種 trajectory 的註記方式為:$s_{0}, a_{0}, r_{0}, s_{1}, ...$ ,兩個都可以 在 agent 和 environment 的交互之下,一個 agent 會採取一連串的 actions $\{a_{t}\}$,並觀察到一連串的 states $\{s_{t}\}$ 以及獲得一連串的 rewards $\{r_{t}\}$。 一個 RL 演算法常包含以下要素: 1. **Model**: 描述環境會如何針對 agent 的行為而改變 包含 **transition model**,是所有 states 的集合 $S$ 的機率分布: $$ P(s_{t+1}=s^\prime|s_{t}=s, a_{t}=a) $$ 以及 **reward model**: $$ r(s_{t}=s, a_{t}=a)=\mathbb{E}[r_{t}|s_{t}=s, a_{t}=a] $$ 2. **Policy**: 從 states 映射到 actions 的函數,決定 agent 會如何採取的 action 可以是 deterministic 的: $$ \pi(s)=a $$ 也可以是 stochastic 的: $$ \pi(a|s)=P(a_{t}=a|s_{t}=s) $$ 3. **Value function**: 當遵從某個 policy 時,在特定 state 或 action 之下可以獲得的 future rewards,定義為: $$ V^\pi(s_{t}=s)=\mathbb{E}_{\pi}[r_{t+1}+\gamma r_{t+2}+\gamma ^2 r_{t+3}+...|s_{t}=s] $$ 以上三點更詳細的概念會在後續陸續提到。 根據是否有明確的 model,RL algorithm 可以被分為以下兩種: * Model-based * Model-free 下圖分類了各種不同的 RL agents: ![](https://i.imgur.com/XEO0m1n.png =500x) ## Markov Decision Process 建構環境 (Model the environment) 的方法有很多種,其中最基礎且重要的方法是 Markov Decision Process (MDP)。 ### Markov chain Markov chain 是一種隨機過程,我們可以用經典的 Mars Rover Problem 作為例子: ![](https://i.imgur.com/6OZsugy.png) 一個抽樣的 trajectory 記作 $s_{0}, s_{1}, s_{2}, ..., s_{t}, s_{t+1}, ...$,例如 $x_{3}, x_{4}, x_{5}, x_{4}, x_{5}, ...$,而箭頭上的數值代表 transition probability,例如 $P(s_{t+1}=x_{2}|s_{t}=x_{1})=0.5$ Markov chain 符合 Markov property,意即給定當前的 state ,則未來和過去的狀態無關。 :::info **Markov Property** A state $s_{t}$ is Markov if and only if $$ P(s_{t+1}|s_{t})=P(s_{t+1}|s_{1}, ..., s_{t}) $$ ::: :::info **Markov Chain** A Markov chain $(S, P)$ is specified by: 1. State space $S$: a (finite) set of possible states. 2. Transition matrix $P=[P_{ss^\prime}]$ with elements $P_{ss^\prime}=P(s_{t+1}=s^\prime|s_{t}=s)$ ::: $P$ 是一個 $|S|\times|S|$ 的矩陣,即 $$ P= \left[ \begin{array}{ccc} P_{11} & P_{12} & ... & P_{1n} \\ P_{21} & P_{22} & ... & P_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ P_{n1} & P_{n2} & ... & P_{nn} \end{array} \right] $$ 其中 $P_{ij}=P(s_{t+1=j}|s_{t=i})$ $P$ 具有兩個性質: 1. 每個 entry 都是非負的 這是因為每個 entry 都是一個機率值,不會出現負數 2. Row sum = 1 例如:$P$ 的第一行是由 $s_{1}$ 轉換到 $s_{1}, ..., s_{n}$ 的機率,故全部的值相加會等於 $1$ ### Markov Reward Process 在 Markov chain 的基礎上,加上 reward function,則我們可以定義 Markov Reward Process: :::info **Markov Reward Process** An MRP $(S, P, R, \gamma)$ is specified by *Underlying dynamics*: 1. State space $S$ 2. Transition matrix $P=[P_{ss^\prime}]$ with $P_{ss^\prime}=P(s_{t+1}=s^\prime|s_{t}=s)$ *Task*: 3. Reward function $R_{s}=\mathbb{E}[r_{t+1}|s_{t}=s]$ 4. Discount factor $\gamma \in [0, 1]$ ::: Markov Reward Process 用於描述一個在不同狀態之間轉換的系統,其中每個狀態都有一個與之相關聯的即時 reward,並且轉換的過程滿足 Markov Property。 :::warning **Return** $G_{t}$ is the cumulative discounted rewards over a single trajectory from $t$ onward (random). $$ G_{t}=r_{t+1}+\gamma r_{t+2}+\gamma ^2 r_{t+3}+...=\sum_{k=0}^\infty \gamma^k r_{t+k+1} $$ ::: :::warning **State-value function** $V(s)$ is the expected return if state $s$ is the starting state. $$ \begin{aligned} V(s) &=\mathbb{E}[G_{t}|s_{t}=s] \\ &= \mathbb{E}[r_{t+1}+\gamma r_{t+2}+\gamma ^2 r_{t+3}+...|s_{t}=s] \end{aligned} $$ ::: Return 的計算十分簡單。同樣再以 Mars Rover Problem 作為例子,令在 $x_{1}$ 的 reward $R_{1}=0.05$,$R_{6}=1$,其他為 $0$。令 $\gamma = 0.9$,對於 episode $x_{5}, x_{6}, x_{6}, x_{5}, x_{4}$,return 為: $$ G_{t} = 0 + (1 \times 0.9) + (1 \times 0.9^2) + (0 \times 0.9^3) + (0 \times 0.9^4) $$ 而 state-value function $V(s)$ 該如何計算呢? 因為 $V(s)$ 是 $G_{t}$ 的期望值,一個非常直觀的方法是使用暴力解,只要我們抽樣夠多的 trajectories,則它們 $G_{t}$ 的平均即是 $V(s)$。 另一種較常見的方法是使用 dynamic programming: $$ \begin{aligned} V(s) &=\mathbb{E}[G_{t}|s_{t}=s] \\ &= \mathbb{E}[r_{t+1}+\gamma r_{t+2}+\gamma ^2 r_{t+3}+...|s_{t}=s] \\ &= \mathbb{E}[r_{t+1}+\gamma G_{t+1}|s_{t}=s] \\ &= \mathbb{E}[r_{t+1}|s_{t}=s]+ \gamma \mathbb{E}[G_{t+1}|s_{t}=s] \\ &= R_{s} + \gamma \sum_{s^\prime} P(s^\prime|s_{t}=s) \mathbb{E}[G_{t+1}|s_t=s, s_{t+1}=s^\prime] \\ &= R_{s} + \gamma \sum_{s^\prime} P(s^\prime|s_{t}=s) \mathbb{E}[G_{t+1}|s_{t+1}=s^\prime] \\ &= R_{s} + \gamma \sum_{s^\prime} P_{ss^\prime} V(s^\prime) \end{aligned} $$ > $\mathbb{E}[G_{t+1}|s_t=s, s_{t+1}=s^\prime]=V(s^\prime)$ 是因為符合 Markov Property 由此我們導出在 MRP 使用的 Bellman Expectation Equation: :::warning **Bellman Expectation Equation** for MRP $$ V(s)=R_{s} + \gamma \sum_{s^\prime} P_{ss^\prime} V(s^\prime) $$ ::: Bellman Expectation Equation 寫成矩陣型式: $$ V=R+ \gamma PV \\ => \left[ \begin{array}{ccc} V(s_1) \\ V(s_2) \\ \vdots \\ V(s_n) \end{array} \right] = \left[ \begin{array}{ccc} R_1 \\ R_2 \\ \vdots \\ R_n \end{array} \right] + \gamma \left[ \begin{array}{ccc} P_{11} & P_{12} & ... & P_{1n} \\ P_{21} & P_{22} & ... & P_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ P_{n1} & P_{n2} & ... & P_{nn} \end{array} \right] \left[ \begin{array}{ccc} V(s_1) \\ V(s_2) \\ \vdots \\ V(s_n) \end{array} \right] $$ 因為 $V=R+ \gamma PV$,經由推導我們可得 $V=(I - \gamma P)^{-1} R$,得到 $V$ 之解。 然而若直接計算反矩陣來得到解,計算的時間複雜度為$O(n^3)$,非常沒有效率,故我們需要其他的解法,這會在之後的內容中介紹。 --- 在 MRP 中我們還使用到了 discount factor $\gamma$。直觀而言,我們較不在乎離當前時間越遠的未來,故 future reward 會乘上 $\gamma$;數學上,有了 discount factor,可以避免 return 無法收斂。 $\gamma$ 的選擇: * Continuing environment: 固定的 $\gamma < 1$ * Episodic environment: $\gamma \leq 1$ ### Markov Decision Process 在 Markov Reward Process 的基礎上,加上各種可能的 actions ,則我們可以定義 Markov Decision Process: :::info **Markov Decision Process** An MDP $(S, A, P, R, \gamma)$ is specified by *Underlying dynamics*: 1. State space $S$ 2. Action space $A$ 3. Transition matrix $P=[P_{ss^\prime}^a]$ with $P_{ss^\prime}^a=P(s_{t+1}=s^\prime|s_{t}=s, a_t=a)$ *Task*: 3. Reward function $R_{s}^a=\mathbb{E}[r_{t+1}|s_{t}=s, a_t=a]$ 4. Discount factor $\gamma \in [0, 1]$ ::: > MDP 描述一個由狀態、行動、獎勵和轉移概率所構成的決策過程 有了 states 和 actions 後,我們便能夠定義 **policy**: :::warning A (randomized) **policy** $\pi$ is a conditional distribution over possible actions given state $s$, i.e., for any $s \in S$, $a \in A$, $$ \pi(a|s):=P(a_t=a|s_t=s) $$ ::: 因為我們已經假設整個過程符合 Markov Property,故這裡我們專注於討論 stationary policy,意指 $\pi$ 和時間沒有關係。 對於一個 MDP,如果我們固定 policy $\pi(a|s)$,則: $$ P_{ss^\prime}^\pi = \sum_{a \in A} \pi(a|s) P_{ss^\prime}^a $$ > 此固定 policy 之下由 $s \rightarrow s^\prime$ 的機率 = $\sum_{a \in A}$ (此 policy 下在 $s$ 下採取 $a$ 的機率) X (採取 $a$ 時 $s \rightarrow s^\prime$ 的機率) $$ R_{s}^\pi = \sum_{a \in A} \pi(a|s) R_{s}^a $$ > 此固定 policy 之下 $s$ 作為起點的 reward = $\sum_{a \in A}$ (此 policy 下在 $s$ 下採取 $a$ 的機率) X (採取 $a$ 時 $s$ 作為起點的 reward) :::danger 綜合以上兩點,我們可以發現對於一個 MDP,如果我們固定 policy $\pi(a|s)$,則我們可以得到 $\pi$-induced MRP $(S, P^\pi, R^\pi, \gamma)$ ::: 多了 actions 之後,我們便可以定義 MDP 的 return 和 state-value function: :::warning **Return** $G_{t}$ is the cumulative discounted rewards over a single trajectory from $t$ onward (random). $$ G_{t}=r_{t+1}+\gamma r_{t+2}+\gamma ^2 r_{t+3}+...=\sum_{k=0}^\infty \gamma^k r(s_{t+k+1}, a_{t+k+1}) $$ ::: :::warning **State-value function** $V^\pi(s)$ is the expected return if state $s$ is the starting state. $$ V^\pi(s)=\mathbb{E}[G_{t}|s_{t}=s;\pi] $$ ::: 值得注意的是,$V^\pi(s)$ 具有隨機性,來源包含: 1. Reward 2. State transition 3. Policy ## Model-Based Evaluation and Control ### MDP Policy Evaluation 我們該如何評價一個 policy 的好壞呢? 在此需要先介紹新的定義: :::warning **Action-value function** $Q^\pi(s, a)$ is the expected return if we start from state $s$ and take action $a$, and then follow the original policy $\pi$. $$ Q^\pi(s, a)= \mathbb{E}[G_t|s_t=s, a_t=a;\pi] $$ ::: ![](https://i.imgur.com/SgsGL7V.png) $V^\pi(s)$ 和 $Q^\pi(s, a)$ 是具有關聯性的,可以互相以對方表示,也可以寫成遞迴式: :::warning **Bellman Expectation Equations**: * $V$ written in $Q$: $$ V^\pi(s)= \sum_{a \in A} \pi(a|s)Q^\pi(s, a) $$ * $Q$ written in $V$: $$ Q^\pi(s, a)=R_{s}^a + \gamma \sum_{s^\prime \in S} P_{ss^\prime}^a V^\pi(s^\prime) $$ * $V$ written in $V$: $$ V^\pi(s)=\sum_{a \in A} \pi(a|s) \Bigr[R_{s}^a + \gamma \sum_{s^\prime \in S} P_{ss^\prime}^a V^\pi(s^\prime) \Bigr] $$ * $Q$ written in $Q$: $$ Q^\pi(s, a)=R_{s}^a + \gamma \sum_{s^\prime \in S} P_{ss^\prime}^a \Bigr[\sum_{a^\prime \in A} \pi(a^\prime|s^\prime)Q^\pi(s^\prime, a^\prime) \Bigr] $$ ::: 因為 $V^\pi$ 可以寫成自己的遞迴式,因此給定一個 policy $\pi$,我們可以找出 $V^\pi$ 的值,此即為 policy evaluation。 兩種 policy evaluation 的方法: 1. Non-iterative policy evaluation Matrx form: $$ \begin{aligned} &V^\pi=R^\pi+ \gamma P^\pi V^\pi \\ =>\ &V^\pi=(I- \gamma P^\pi)^{-1} R^\pi \end{aligned} $$ 3. Iterative policy evaluation :::info **Iterative MDP Policy Evaluation (IPE)** For a fixed policy $\pi$: Initialize $V_0^\pi(s)=0$ for all $s$ For $k = 1, 2, ...$ &emsp;for all $s$ &emsp;&emsp;$V^\pi_k(s)=\sum_{a \in A} \pi(a|s) \Bigr[R_{s}^a + \gamma \sum_{s^\prime \in S} P_{ss^\prime}^a V^\pi_{k-1}(s^\prime) \Bigr]$ ::: 代入 $V_0^\pi(s)$ 可以得到 $V_1^\pi(s)$,代入 $V_1^\pi(s)$ 可以得到 $V_2^\pi(s)$,迭代直到 converge。 > 如果初始值 $V_0^\pi(s)=V^\pi(s)$,則一次迴圈就收斂啦 ___ 我們可能會好奇,利用 IPE 計算是否真的能夠使 $V^\pi_k(s)$ 收斂至 $V^\pi(s)$? 考慮一個 matrix vector space $V_0^\pi(s), V_1^\pi(s), ..., V_k^\pi(s) \in \mathbb{R}^{|S|}$,我們可以透過證明以下兩步驟來證明 IPE 會使 $V_k^\pi(s)$ converge: 1. IPE is an contraction operator. 2. Under any contraction operator, the points converge to a fixed point. 定義 IPE Operator: :::warning **IPE Operator** (also called **Bellman Expectation Backup Operator**) $$ T^\pi(V):=R^\pi+ \gamma P^\pi V $$ ::: 我們使用 $L_\infty$-norm 來衡量兩個 value function $V, V^\prime$ 之間的距離: $$ ||V-V^\prime||_\infty := max_{s \in S} |V(s) - V(s^\prime)| $$ **IPE is an contraction operator.** *proof*: $$ \begin{aligned} ||T^\pi(V) - T^\pi(V^\prime)||_\infty &= ||(R^\pi+ \gamma P^\pi V) - (R^\pi+ \gamma P^\pi V^\prime)||_\infty \\ &= \gamma ||P^\pi(V-V^\prime)||_\infty \\ &\leq \gamma ||V-V^\prime||_\infty \end{aligned} $$ > 第二行至第三行不等式成立是因為 $P^\pi$ 的每個 entry 都 $\leq 1$ 我們稱 $T^\pi$ 為 $\gamma$-contraction operator。 **Under any contraction operator, the points converge to a fixed point.** 上述可以根據 Banach Fixed-Point Theorem 得證: :::info **Banach Fixed-Point Theorem** For any non-empty complete metric space, if $T$ is a $\gamma$-contraction operator, then $T$ has a unique fixed point. ::: > Complete: 表示極限 (limit) 是在 space 裡面 :::danger Summary: Under **IPE**, the value function $V_k^\pi$ converges to the correct $V^\pi$, for **any** initialization $V_0^\pi$. ::: ### Finding Optimal Policy 我們知道如何進行 policy evaluation 了,根據計算出來的評價,以下討論如何去找到最佳的 (optimal) policy。 定義: :::warning **Optimal State-Value Function** $$ V^*(s)=\max_\pi V^\pi(s) $$ ::: > 對任意 state $s$ 遵循最佳 policy 所能得到的 expected return :::warning **Optimal Action-Value Function** $$ Q^*(s, a)=\max_\pi Q^\pi(s, a) $$ ::: > 對任意 state $s$ 執行 $a$ 後再遵循最佳 policy 繼續執行所能得到的 expected return 什麼條件下才能說一個 policy 比另一個更好呢? :::info **Partial Ordering of Policies** $$ \pi \geq \pi^\prime \ \rm{if} \ V^\pi(s) \geq V^{\pi^\prime}(s), \forall s $$ ::: > 注意!一定要對每個狀態都成立 :::info **Optimal Policy** A policy $\pi^*$ is an **optimal policy** if it is better than or equal to all other policies, i.e. $$ \pi^* \geq \pi,\ \rm{for\ all} \ \pi $$ ::: $\pi^*$ 無論如何都會存在嗎? :::info **Theorem (Existence of Optimal Policy)** For any finite MDP, there always exists an optimal policy $\pi^*$ s.t. $$ \pi^* \geq \pi,\ \rm{for\ all} \ \pi $$ ::: ___ Optimal policy 存在,但該如何找到? 當我們計算出了每個狀態下的最佳 action-value function $Q^*(s, a)$ 時,我們便可以依此找出最優策略: $$ \pi^*(a|s) = \arg\max_{a\in A} Q^*(s,a) $$ 也就是說對於每一個狀態 $s$,我們都能夠確定最佳的動作 $a$(即最大化 $Q^*(s,a)$ 的動作) ,從而找到對應的 optimal policy。 因此,**計算出 $Q^*(s, a)$ 等同於找到 optimal policy**。 接下來將討論如何找到 $Q^*(s, a)$。 定義: :::warning **Bellman Optimality Equations**: * $V^*$ written in $Q^*$: $$ V^*(s)= \max_a Q^*(s, a) $$ * $Q^*$ written in $V^*$: $$ Q^*(s, a)=R_{s}^a + \gamma \sum_{s^\prime \in S} P_{ss^\prime}^a V^*(s^\prime) $$ * $V^*$ written in $V^*$: $$ V^*(s)=\max_a \Bigr[R_{s}^a + \gamma \sum_{s^\prime \in S} P_{ss^\prime}^a V^*(s^\prime) \Bigr] $$ * $Q^*$ written in $Q^*$: $$ Q^*(s, a)=R_{s}^a + \gamma \sum_{s^\prime \in S} P_{ss^\prime}^a (\max_a Q^*(s^\prime, a)) $$ ::: Bellman Optimality Equations 因為需要取最大值,無法像 Bellman Expectation Equations 一樣使用線性代數解得答案。以下為兩種 iterative 的求解方法: 1. Value Iteration 2. Policy Iteration #### Value Iteration 我們知道 Bellman Optimality Equation: $$ V^*(s)=\max_a \Bigr[R_{s}^a + \gamma \sum_{s^\prime \in S} P_{ss^\prime}^a V^*(s^\prime) \Bigr] $$ 因此我們可以透過 Bellman Optimality Equation 定義 Bellman Optimality Operator: :::warning **Bellman Optimality Operator** $$ T^*(V)= \max_{a \in A} (R^a + \gamma P^a V) $$ ::: 利用 Bellman Optimality Operator 進行 Value Iteration: :::info **Value Iteration** Initialize $k=0$ and set $V_0(s)=0$ for all $s$ Repeat until convergence: $V_{k+1} \leftarrow T^*(V_k)$ ::: 計算 $V_{k+1}=T^*(V_k)$ 等同於計算: for each $s$ &emsp; $V_{k+1}(s)=\max_a \Bigr[R_{s}^a + \gamma \sum_{s^\prime \in S} P_{ss^\prime}^a V_k(s^\prime) \Bigr]= \max_a Q_k(s, a)$ $\sum_{s^\prime \in S} P_{ss^\prime}^a V^*(s^\prime)$ 需花 $O(|S|)$,找所有 action 中的 max 花 $O(|A|)$,而每步驟要對所有 states 計算花 $O(|S|)$,故 value iteration 的時間複雜度為 $O(|S|^2|A|)$。 值得注意的是,在未收斂之前,我們在迭代的過程中計算出的 intermediate value function $V_k$ 可能不能對應到任何特定的 policy。 當我們找到 $V^*$ 之後,便可以透過 greedy policy 來找到最佳 policy。對於所有 $s$,選擇一個 action $a$,使得 $Q^*(s, a) = r + \gamma V^*(s')$ 最大化,即 $$ \pi^*(s) = \arg\max_a Q^*(s, a) = \arg\max_a \left( R(s, a) + \gamma \sum{s'} P(s'|s, a) V^*(s')\right) $$ 此 policy 即為 optimal policy。 --- Value Iteration 並不是很好理解,我們用馬力歐找寶藏的遊戲作為例子來解釋 Value Iteration 整個過程: ![](https://i.imgur.com/ENiFDM5.png =300x) * States: 馬力歐的位置,在九宮格上共 9 種可能 * Actions: 上下左右,若有寶箱可拿寶箱 * Reward: * 馬力歐拿到寶箱,$R^a_s=0$ * 馬力歐朝任一方向移動,$R^a_s=-1$ * 假設 $\gamma=1$,且除非撞到牆壁,否則 $P^a_{ss^\prime}=1$ 假設座標由右至左、由上至下標註,寶箱在 (2, 3)。每一次迭代中,對於每個 state,我們都根據 Bellman optimality equation 計算它們的下一步 value,即計算 $$ V_{k+1}(s)=\max_a \Bigr[R_{s}^a + \gamma \sum_{s^\prime \in S} P_{ss^\prime}^a V_k^*(s^\prime) \Bigr]= \max_a Q_k(s, a) $$ 迭代過程如下: ![](https://i.imgur.com/C5KvIkI.png) * 初始化:所有 states 的 $V_0(S)=0$ * 第一次 iteration * $V_1((2, 3))=0+0=0$ * 其他 states $V_1(s)$ 之值皆為 $-1$,例如:$V_1((1, 1))=-1+0=-1$ * 第二次 iteration * $V_2((2, 3))=0+0=0$ * 對於 $(1, 1)$ ,往下或往右可以有最大的 value,故 $V_2((1, 1))=(-1)+(-1)=-2$,$(1, 2), (2, 1), (3, 1), (3, 2)$ 同理 * 對於 $(1, 3)$ ,往右可以有最大的 value,故 $V_2((1, 3))=(-1)+0=-1$,其他位於寶箱附近的 states 同理 * 第三次 iteration * $V_3((2, 3))=0+0=0$ * 對於 $(1, 1)$ ,往下或往右可以有最大的 value,故 $V_3((1, 1))=(-1)+(-2)=-3$,其他距離寶箱 3 步的 states 同理 * 對於 $(1, 2)$ ,往下或往右可以有最大的 value,故 $V_3((1, 2))=(-1)+(-1)=-2$,其他距離寶箱 2 步的 states 同理 * 對於 $(1, 3)$ ,往右可以有最大的 value,故 $V_3((1, 3))=(-1)+0=-1$,其他位於寶箱附近的 states 同理 * 第四次 iteration 所有 $V(s)$ 皆沒有變化,故已收斂 > 參考資料: https://zhuanlan.zhihu.com/p/33229439 ___ 能夠用 Value Iteration 來找最佳 policy 的原因是 Value Iteration 收斂的性質: :::info **Value iteration converges on $V^*$** For any initial $V_0$, value iteration achieves that $V_k \rightarrow V^*$ (in $L_\infty$-norm) ::: 此性質可透過以下三步來證明: **1. $T^*$ is a contraction operator.** *proof:* $$ \begin{aligned} ||T^*(V)-T^*(\hat{V})||_\infty &= \max_s |T^*(V(s))-T^*(\hat{V(s)})| \\ &= \max_s |\max_a(R^a_s + \gamma \sum_{s^\prime} P^a_{ss^\prime} V(S^\prime)) - \max_{a^\prime}(R^{a^\prime}_s + \gamma \sum_{s^\prime} P^{a^\prime}_{ss^\prime} V(S^\prime))| \\ &\leq \max_s\max_a|(R^a_s + \gamma \sum_{s^\prime} P^a_{ss^\prime} V(S^\prime)) - (R^{a}_s + \gamma \sum_{s^\prime} P^{a}_{ss^\prime} V(S^\prime))| \\ &\leq \max_s\max_a|\gamma \sum_{s^\prime} P^a_{ss^\prime} (V(s^\prime) - \hat{V}(s^\prime))| \\ &\leq \gamma ||(V-\hat{V})||_\infty \end{aligned} $$ => $T^*$ is a $\gamma$-contraction operator. **2. Under a contraction operator, ${V_k}$ shall converge to the unique fixed point.** *proof:* Since $T^*$ is a $\gamma$-contraction operator in a complete metric space, by Banach Fixed Point Theorem, $T^*$ converges to the unique fixed point. **3. Since $V^*$ is a fixed point, $V_k \rightarrow V^*$ due to uniqueness.** > 若 $\gamma$ 很接近 $1$,則 value iteration 可能會迭代無限次而不收斂。 #### Policy Iteration Policy iteration 的目的是透過交替進行 policy evaluation 和 policy improvement,逐步優化 policy,直到找到最佳 policy。 ![](https://i.imgur.com/NoiJDQr.png =500x) 1. Policy evaluation: 固定 policy,計算出對應的 $V^\pi(s)$,可使用任意 policy evaluation 的演算法 2. Policy improvement: 找出 policy $\pi^\prime \geq \pi$,可使用任意 policy improvement 的演算法 :::info **Policy Iteration** (註:這裡針對 deterministic policy) Initialize $k=0$ and set $\pi_0(s)$ arbitrarily for all $s$ While $k=0$ or $\pi_k \neq \pi_{k-1}$: &emsp;Derive $V^{\pi_k}$ via **policy evaluation** for $\pi_k$ &emsp;Derive $\pi_{k+1}$ by greedy **one-step policy improvement** ::: :::info **One-Step Policy Improvement** Given $V^{\pi_k}$, compute $Q^{\pi_k}(s, a)$: $$ Q^{\pi_k}(s, a)=R(s, a)+\gamma \sum_s P(s^\prime|s, a) V^{\pi_k}(s^\prime) $$ Derive the new policy $\pi_{k+1}$ for all states $s$: $$ \pi_{k+1} (s) = \arg\max_a Q^{\pi_k}(s,a) $$ ::: 為何 One-step policy improvement 合理? 假設我們採取 $\pi_{k+1}$ 只進行一步動作,之後的動作維持遵循 $\pi_k$,會比從頭到尾都遵循 $\pi_k$ 來得更好。這是因為 $\pi_{k+1}$ 選擇的 action 是使得 $Q^{\pi_k}(s,a)$ 最大的 action,因此 $$ \begin{aligned} V^{\pi_{k+1}}(s) &= Q^{\pi_k}(s, \pi_{k+1}(s))\\ &= \max_a Q^{\pi_k}(s,a) \\ &\geq R(s, \pi_k(s))+\gamma \sum_s P(s^\prime|s,a) V^{\pi_k}(s^\prime) \\ &= V^{\pi_k}(s) \end{aligned} $$ 若從頭到尾都採取 $\pi_{k+1}$,也會比遵循 $\pi_k$ 好: :::info **Theorem (Monotonic Policy Improvement)** Under the one-step policy improvement step, we have $V^{\pi_{k+1}}(s) \geq V^{\pi_k}(s), \forall s$, and hence $\pi_{k+1} \geq \pi_k$ ::: *proof:* $$ \begin{aligned} V^{\pi_k}(s) &\leq \max_a Q^{\pi_k}(s, a)\\ &= \max_a R(s,a) + \gamma \sum_{s^\prime \in S} P(s^\prime|s,a)V^{\pi_k}(s^\prime) \\ &= R(s, \pi_{k+1}(s))+\gamma \sum_{s^\prime \in S} P(s^\prime|s,\pi_{k+1}(s))V^{\pi_k}(s^\prime) \\ &\leq R(s, \pi_{k+1}(s))+\gamma \sum_{s^\prime \in S} P(s^\prime|s,\pi_{k+1}(s))\max_{a^\prime} Q^{\pi_k}(s^\prime, a^\prime) \\ &= R(s, \pi_{k+1}(s))+\gamma \sum_{s^\prime \in S} P(s^\prime|s,\pi_{k+1}(s))\\ &\ \ \ \ \times(R(s^\prime,\pi_{k+1}(s)) + \gamma \sum_{s^{\prime\prime} \in S} P(s^{\prime\prime}|s^\prime,\pi_{k+1}(s^\prime))V^{\pi_k}(s^{\prime\prime})) \\ &... \\ &= V^{\pi_{k+1}}(s) \end{aligned} $$ 這個證明的想法可以用 “Peeling off" 的概念來想像: ![](https://i.imgur.com/9BzdjVu.jpg) 已知由 $s$ 遵循 $\pi_{k+1}$ 到 $s^\prime$ 後再遵循 $\pi_k$ 比從頭到尾遵循 $\pi_k$ 來得好,則同理,由 $s^\prime$ 遵循 $\pi_{k+1}$ 到 $s^{\prime\prime}$ 後再遵循 $\pi_k$ 也會比從 $s^\prime$ 到尾都遵循 $\pi_k$ 來得好,以此類推。 ___ 若我們得到 $\pi_{k+1}=\pi_k$,則代表 policy iteration 收斂了,且也表示 Bellman optimality equation 被 $\pi_k$ 滿足: $$ V^{\pi_k}(s) = \max_a Q^{\pi_k}(s,a) $$ 此時 $\pi_k$ 必為 optimal policy。 :::danger Policy Iteration 和 Value Iteration 共同證明了 optimal policy 存在 ::: 值得一提的是,在只討論 deterministic policy 的情況下,policy iteration 會在有限次數的迭代內停下來。這是因為 deterministic policy 的數量為 $|A|^{|S|}$,因此我們最多迭代 $|A|^{|S|}$ 次一定會停下。 ### Quick Summary | Problem | Bellman Equation | Algorithm | | -------- | -------- | -------- | | Prediction | Bellman Expectation Equation | IPE | | Control | Bellman Expectation Equation <br> + Greedy Policy Improvement| Policy Iteration | | Control | Bellman Optimality Equation | Value Iteration | ## Next Reinforcement Learning | Policy Gradient https://hackmd.io/@tsen159/RLNote2

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password
    or
    Sign in via Google Sign in via Facebook Sign in via X(Twitter) Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    By signing in, you agree to our terms of service.

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully