Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning(2)

# Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning(2) ###### tags: `論文翻譯` [原文連結](https://www.sciencedirect.com/science/article/pii/S0004370299000521) ## 4. Interrupting options :::info SMDP methods apply to options, but only when they are treated as opaque indivisible units. More interesting and potentially more powerful methods are possible by looking inside options or by altering their internal structure, as we do in the rest of this paper. In this section we take a first step in altering options to make them more useful. This is the area where working simultaneously in terms of MDPs and SMDPs is most relevant. We can analyze options in terms of the SMDP and then use their MDP interpretation to change them and produce a new SMDP. ::: :::success SMDP適用於options，不過也只有在它們被視為[不透明](http://terms.naer.edu.tw/detail/838833/)的[不可分割](http://terms.naer.edu.tw/detail/2908229/)單元的時候才適用。正如我們在這篇論文的其寫部份打算做的那樣，透過查看內部的options或是改變它們的內部結構，我們可以實現更有趣而且更可能強大的方法。在這一節，我們要往改變options以使其更為有用的這個方向邁出第一步。這是一個同時面向MDPs與SMPDs最相關的領域。我們可以根據SMDP來分析options，然後從它們的MDP的角度來切入改變，以產生新的SMDP。 ::: :::info In particular, in this section we consider interrupting options before they would terminate naturally according to their termination conditions. Note that treating options as indivisible units, as SMDP methods do, is limiting in an unnecessary way. Once an option has been selected, such methods require that its policy be followed until the option terminates. Suppose we have determined the option-value function $Q_\mu(s, o)$ for some policy $\mu$ and for all state-option pairs $s, o$ that could be encountered while following $\mu$. This function tells us how well we do while following $\mu$, committing irrevocably to each option, but it can also be used to re-evaluate our commitment on each step. Suppose at time t we are in the midst of executing option $o$. If $o$ is Markov in $s_t$ , then we can compare the value of continuing with $o$, which is $Q^\mu(s_t, o)$, to the value of interrupting $o$ and selecting a new option according to $\mu$, which is $V^\mu(s) = \sum_q \mu(s, q) Q^\mu(s, q)$. If the latter is more highly valued, then why not interrupt o and allow the switch? If these were simple actions, the classical policy improvement theorem [27] would assure us that the new way of behaving is indeed better. Here we prove the generalization to semi-Markov options. The first empirical demonstration of this effect—improved performance by interrupting a temporally extended substep based on a value function found by planning at a higher level—may have been by Kaelbling [31]. Here we formally prove the improvement in a more general setting. ::: :::success 特別是，我們在這一節要考慮的是，在options根據它們的終止條件自然終止之前中斷(interrupting)options。注意到，就像是SMDP方法做的那樣，把options視為不可分割單元是以一種不必要的方式來限制。一旦選擇一個option，這類的方法就會要求依循其策略(policy)，一直到options終止。假設我們已經為某些policy $\mu$以及所有的state-option pairs $s, o$確定option-value function $Q_\mu(s, o)$，其中那些state-option pairs就是你用著policy $\mu$會遇到的。這個function會告訴我們，當我們依著policy $\mu$、正式地致力於每個option的時候，我們做的有多好，不過它也是可以拿來重新評估我們對每一個step的保證。如果$o$在$s_t$中的是Markov，那麼我們就可以比較繼續執行option $o$的value，也就是$Q^\mu(s_t, o)$，與中斷的option $o$並根據$\mu$選擇一個新的optoin，也就是$V^\mu(s) = \sum_q \mu(s, q) Q^\mu(s, q)$的value。如果後者比較好，那我們有什麼理由不中斷$o$，然後允許切換呢？如果這些就只是簡單的actions，那經典的策略改進理論[27]就可以向我們保證說，新的行為一定會比較好。這邊我們證明了對semi-Markov options的泛化性。對於這種應用的第一個經驗證明來自於Kaelbling[31]，透過中斷基於更高級別(higher level)的planning所發現到的value function的時間延伸(temporally extended)的子步驟(substep)來提高效能。這邊我們會在更一般的設置中證明這個改進。 ::: :::warning 譯者註：committing irrevocably to each option；commit to：致力於？；而irrevocably有著不可逆的意思，不過看了一下[知乎](https://zhuanlan.zhihu.com/p/61299899)有人用著『正式』來解釋，看著看著覺得有道理，就採用了。 ::: :::info In the following theorem we characterize the new way of behaving as following a policy $\mu'$ that is the same as the original policy, µ, but over a new set of options; $\mu'(s, o) = \mu(s, o)$, for all $s \in \mathcal{S}$. Each new option $o'$ is the same as the corresponding old option $o$ except that it terminates whenever switching seems better than continuing according to $Q^\mu$. In other words, the termination condition $\beta'$ of $o'$ is the same as that of $o$ except that $\beta'(s) = 1 \text{ if } Q^\mu(s, o) < V^\mu(s)$. We call such a $\mu'$ an interrupted policy of $\mu$. The theorem is slightly more general in that it does not require interruption at each state in which it could be done. This weakens the requirement that $Q^\mu(s,o)$ be completely known. A more important generalization is that the theorem applies to semiMarkov options rather than just Markov options. This generalization may make the result less intuitively accessible on first reading. Fortunately, the result can be read as restricted to the Markov case simply by replacing every occurrence of “history” with “state” and set of histories, $\Omega$, with set of states, $\mathcal{S}$. ::: :::success 下面的定理中，我們會把新的行為(behaving)描述成依循著polciy $\mu'$，這是一個與原始policy一樣的policy，不過是用於一個新的options的集合上；對於所有的$s \in \mathcal{S}$，$\mu'(s, o) = \mu(s, o)$。每一個新的option $o'$都與相對應的舊的option $o$一樣，只是切換上會根據$Q^\mu$，如果看似較好就切換，不再繼續執行同一個option。換句話說，$o'$的終止條件$\beta'$跟$o$是一樣的，差別在於$\beta'(s) = 1 \text{ if } Q^\mu(s, o) < V^\mu(s)$。我們把這樣的$\mu'$稱為$\mu$的interrupted policy(中斷策略)。這個理論比較更為一般化的，因為它並不需要在每個可行的狀態(state)下中斷。這弱化了$Q^\mu(s,o)$必需要是完全已知的要求。一個更重要的泛化性就是，這個定理也適用於semi-Markov options，而非單純適用於Markov options。這種泛化性可能會讓你在第一次閱讀的時候不是那麼直觀瞭解。幸運的是，只需要每次出現的"history"替換成"state"，然後history的集合$\Omega$替換成states的集合$\mathcal{S}$，就可以把成果解讀為僅限於馬可夫情形。 ::: :::info **Theorem 2** (Interruption). For any MDP, any set of options $\mathcal{O}$, and any Markov policy $\mu: \mathcal{S} \times \mathcal{O} \to [0, 1]$, define a new set of options, $\mathcal{O}'$ , with a one-to-one mapping between the two option sets as follows: for every $o = \langle \mathcal{I} , \pi \beta \rangle \in \mathcal{O}$ we define a corresponding $o' = \langle \mathcal{I} , \pi \beta' \rangle \in \mathcal{O}'$ , where $\beta = \beta'$ except that for any history $h$ that ends in state $s$ and in which $Q^\mu(h, o) < V^\mu(s)$, we may choose to set $\beta'(h)=1$. Any histories whose termination conditions are changed in this way are called interrupted histories. Let the interrupted policy µ0 be such that for all $s \in \mathcal{S}$, and for all $o' \in \mathcal{O}'$ , $\mu'(s, o') = \mu(s, o)$, where $o$ is the option in $\mathcal{O}$ corresponding to $o'$. Then * (i)$V^{\mu'}(s) \geq V^\mu(s) \text{ for all } s \in \mathcal{S}$ * (ii) If from state $s \in \mathcal{S}$ there is a non-zero probability of encountering an interrupted history upon initiating $\mu'$ in $s$, then V^{\mu'}(s) > V^\mu(s)$. ::: :::success **Theorem 2** (Interruption). 對於任意的MDP、options的任意集合$\mathcal{O}$、任意的Markov policy $\mu: \mathcal{S} \times \mathcal{O} \to [0, 1]$，定義一個新的options的集合，$\mathcal{O}'$，在兩個option集合之間以下面一對一關聯的方式：每一個$o = \langle \mathcal{I} , \pi \beta \rangle \in \mathcal{O}$，我們定義一個相對應的$o' = \langle \mathcal{I} , \pi \beta' \rangle \in \mathcal{O}'$，其中$\beta = \beta'$，除了結束於state $s$並且$Q^\mu(h, o) < V^\mu(s)$的任意history $h$除外，我們可以選擇設定$\beta'(h)=1$。任何以這種方式改變其終止條件的histories皆稱為interrupted histories。假設interrupted policy $\mu'$能夠滿足$\mu'(s, o') = \mu(s, o)$，$\forall s \in \mathcal{S}$、$\forall o' \in \mathcal{O}'$，其中$o$是$\mathcal{O}$裡面對應於$o'$的option。那麼 * (i)$V^{\mu'}(s) \geq V^\mu(s) \text{ for all } s \in \mathcal{S}$ * 如果從state $s \in \mathcal{S}$開始，然後在state $s$中啟動$\mu'$有一個非零的機率會去遇到interrupted history，那麼$V^{\mu'}(s) > V^\mu(s)$ ::: :::warning * upon的用法參考[英文庫](https://english.cool/upon/) ::: :::info **Proof.** Shortly we show that, for an arbitrary start state $s$, executing the option given by the interrupted policy $\mu'$ and then following policy $\mu$ thereafter is no worse than always following policy $\mu$. In other words, we show that the following inequality holds: $$ \begin{align} & \sum_{o'}\mu'(s,o') \Big[r^{o'}_s + \sum_{s'} p^{o'}_{ss'} V^\mu(s') \Big] \\ & \geq V^\mu(s) = \sum_o \mu(s, o) \Big[r^o_s + \sum_{s'} p^o_{ss'}V^\mu(s') \Big] \tag{14} \end{align} $$ If this is true, then we can use it to expand the left-hand side, repeatedly replacing every occurrence of $V^\mu(x)$ on the left by the corresponding $\sum_{o'}\mu'(x, o')[r^{o'}_x + \sum_{x'}p^{o'}_{xx'}V^\mu(x')]$. In the limit, the left-hand side becomes $V^{\mu'}$ , proving that $V^{\mu'} \geq V^\mu$. ::: :::success **Proof.** 很快的我們就會證明，對於任意的起始狀態(start state)$s$，執行interrupted policy $\mu'$所給定的option，然後依著policy $\mu$，這樣的方式並不會比總是依著policy $\mu$還要來的差。換句話說，我們證明下面不等式成立： $$ \begin{align} & \sum_{o'}\mu'(s,o') \Big[r^{o'}_s + \sum_{s'} p^{o'}_{ss'} V^\mu(s') \Big] \\ & \geq V^\mu(s) = \sum_o \mu(s, o) \Big[r^o_s + \sum_{s'} p^o_{ss'}V^\mu(s') \Big] \tag{14} \end{align} $$ 如果這成立，那我們就可以用這不等式來擴展左邊的部份，重複地用相對應的$\sum_{o'}\mu'(x, o')[r^{o'}_x + \sum_{x'}p^{o'}_{xx'}V^\mu(x')]$來取掉代左側每次出現的$V^\mu(x)$。在這個限制之下，左邊的部份就會變成$V^{\mu'}$，那就證明$V^{\mu'} \geq V^\mu$。 ::: :::info To prove the inequality in (14), we note that for all $s, \mu'(s, o') = \mu(s, o)$, and show that $$ \begin{align} r^{o'}_s + \sum_{s'} p^{o'}_{ss'}V^\mu(s') &= E \left\{ r + \gamma^kV^\mu(s') \vert \mathcal{E}(o', s), h\notin \Gamma \right\} \\ &+ E \left\{ r + \gamma^kV^\mu(s') \vert \mathcal{E}(o', s), h\in \Gamma \right\} \end{align} $$ where $s', r$, and $k$ are the next state, cumulative reward, and number of elapsed steps following option $o$ from $s$, and where $h$ is the history from $s$ to $s'$ . Trajectories that end because of encountering a history not in $\Gamma$ never encounter a history in $\Gamma$ , and therefore also occur with the same probability and expected reward upon executing option $o$ in state $s$. Therefore, if we continue the trajectories that end because of encountering a history in $\Gamma$ with option $o$ until termination and thereafter follow policy $\mu$, we get $$ \begin{align} & E\left\{r + \gamma^k V^\mu(s') \vert \mathcal{E}(o', s), h \notin \Gamma \right\} \\ &+ E \left\{\beta(s')[r + \gamma^k V^\mu(s')] + (1 - \beta(s'))[r + \gamma^k Q^\mu(h, o)] \vert \mathcal{E}(o', s), h\in\Gamma \right\} \\ &=r^o_s + \sum_{s'} p^o_{ss'}V^\mu(s') \end{align} $$ because option $o$ is semi-Markov. This proves (14) because for all $h \in \Gamma$, $$ Q^\mu(h, o) \leq V^\mu(s') $$ Note that strict inequality holds in (15) if $Q^\mu(h, o) < V^\mu(s')$ for at least one history $h \in \Gamma$ that ends a trajectory generated by $o'$ with non-zero probability ::: :::success 為了證明公式14中的不等式，我們注意到，對所有的$s, \mu'(s, o') = \mu(s, o)$，這說明了： $$ \begin{align} r^{o'}_s + \sum_{s'} p^{o'}_{ss'}V^\mu(s') &= E \left\{ r + \gamma^kV^\mu(s') \vert \mathcal{E}(o', s), h\notin \Gamma \right\} \\ &+ E \left\{ r + \gamma^kV^\mu(s') \vert \mathcal{E}(o', s), h\in \Gamma \right\} \end{align} $$ 其中$s', r$與$k$為next state、cumulative reward與你跟$s$開始到$s'$所經過的步數，其中$h$為從$s$到$s'$的history。因為遇到不在$\Gamma$內的history而結束的trajectories永遠都不會遇到在$\Gamma$內的history，也因此，在state $s$內執行option $o$的這件事也會以相同的機率與期望報酬(expected reward)來發生。因此，如果我們繼續這個在$\Gamma$中因為遇到帶有option $o$的history而結束的trajectories，一直到結束，依著policy $\mu$，那我們就會得到： $$ \begin{align} & E\left\{r + \gamma^k V^\mu(s') \vert \mathcal{E}(o', s), h \notin \Gamma \right\} \\ &+ E \left\{\beta(s')[r + \gamma^k V^\mu(s')] + (1 - \beta(s'))[r + \gamma^k Q^\mu(h, o)] \vert \mathcal{E}(o', s), h\in\Gamma \right\} \\ &=r^o_s + \sum_{s'} p^o_{ss'}V^\mu(s') \end{align} $$ 因為option $o$為semi-Markov。這證明了公式14，因為對於所有的$h \in \Gamma$， $$ Q^\mu(h, o) \leq V^\mu(s') $$ 注意到，如果對於最少一個history $h \in \Gamma$(結束一個以非零機率由$o'$所生成的trajectory)其$Q^\mu(h, o) < V^\mu(s')$，則[嚴格的不等式](http://terms.naer.edu.tw/detail/958156/)在公式15中成立。 ::: :::info As one application of this result, consider the case in which $\mu$ is an optimal policy for some given set of Markov options $\mathcal{O}$. We have already discussed how we can, by planning or learning, determine the optimal value functions $V^*_\mathcal{O}$ and $Q^*_\mathcal{O}$ and from them the optimal policy $\mu^*_\mathcal{O}$ that achieves them. This is indeed the best that can be done without changing $\mathcal{O}$, that is, in the SMDP defined by $\mathcal{O}$, but less than the best possible achievable in the MDP, which is $V^*=V^*_\mathcal{A}$. But of course we typically do not wish to work directly with the (primitive) actions $\mathcal{A}$ because of the computational expense. The interruption theorem gives us a way of improving over$\mu^*_\mathcal{O}$ with little additional computation by stepping outside $\mathcal{O}$. That is, at each step we interrupt the current option and switch to any new option that is valued more highly according to $Q^*_\mathcal{O}$. Checking for such options can typically be done at vastly less expense per time step than is involved in the combinatorial process of computing $Q^*_\mathcal{O}$. In this sense, interruption gives us a nearly free improvement over any SMDP planning or learning method that computes $Q^*_\mathcal{O}$ as an intermediate step. ::: :::success 做為這個結果的一個應用，我們考慮一種情況，那就是$\mu$是一個給定某一些Markov options $\mathcal{O}$情況下的optimal policy。我們已經討論過可以如何的通過planning或learning來確定optimal value function $V^*_\mathcal{O}$與$Q^*_\mathcal{O}$，並從中得到能夠得到它們的optimal policy $\mu^*_\mathcal{O}$。這確實是在不需要改變$\mathcal{O}$的情況下能做到最好的狀況(只不過是低於MDP中可能達到的最佳)，也就是說，在這個由$\mathcal{O}$所定義的SMDP中，其$V^*=V^*_\mathcal{A}$。想當然，因為計算成本的問題，我們通常不希望直接的使用(primitive)actions $\mathcal{A}$。interruption theorem給了我們一個方法來改進$\mu^*_\mathcal{O}$，只需要在$\mathcal{O}$之外再多一些些的計算就行。也就是說，在每個time step，我們會中斷(interrupt)當下的option然後切換到任意一個根據$Q^*_\mathcal{O}$估測出更好的那個option。跟計算$Q^*_\mathcal{O}$的組合過程相比，每個 time step去檢查這類的options的花費通常要少的多。某方面來說，與計算$Q^*_\mathcal{O}$做為中間步驟的任意的SMDP planning、learning相比，interruptions給了我們幾乎不需要任何花費的改進。 ::: :::info In the extreme case, we might interrupt on every step and switch to the greedy option— the option in that state that is most highly valued according to Q∗ O (as in polling execution [16]). In this case, options are never followed for more than one step, and they might seem superfluous. However, the options still play a role in determining Q∗ O, the basis on which the greedy switches are made, and recall that multi-step options may enable Q∗ O to be found much more quickly than Q∗ could (Section 3). Thus, even if multi-step options are never actually followed for more than one step they can still provide substantial advantages in computation and in our theoretical understanding. ::: :::success 在極端案例中，我們可能會在每個step都中斷(interrupt)，然後切換到greedy option，也就是在該狀態(state)中，根據$Q^*_\mathcal{O}$估出最好的那一個option(如polling execution [16])。這種情況下，options永遠都不會執行超過一個step，它們看起來像是多餘的。盡管如此，options在確定$Q^*_\mathcal{O}$中扮演好它的角色(發揮它的作用)，這是greedy switches的一個基礎，我們可以回想一下，比起找到$Q^*$，multi-step options可以讓你更快的找到$Q^*_\mathcal{O}$(Section 3)。因此，即使multi-step options從來沒有執行超過一個step，它仍然可以在計算與我們的理論理解方面提供一個一實質的優勢。 ::: :::info Fig. 7 shows a simple example. Here the task is to navigate from a start location to a goal location within a continuous two-dimensional state space. The actions are movements of 0.01 in any direction from the current state. Rather than work with these low-level actions, infinite in number, we introduce seven landmark locations in the space. For each landmark we define a controller that takes us to the landmark in a direct path (cf. [48]). Each controller is only applicable within a limited range of states, in this case within a certain distance of the corresponding landmark. Each controller then defines an option: the circular region around the controller’s landmark is the option’s initiation set, the controller itself is the policy, and arrival at the target landmark is the termination condition. We denote the set of seven landmark options by $\mathcal{O}$. Any action within 0.01 of the goal location transitions to the terminal state, the discount rate $\gamma=1$ is 1, and the reward is −1 on all transitions, which makes this a minimum-time task. ::: :::success Fig. 7給出一個簡單的範例。這邊的任務是在一個連續的二維狀態空間中，從起始位置導航到目標位置。actions是以0.01從當下的狀態(current state)往任意方向作動。跟這些無限數量的低級動作(low-level actions)不同，我們在這個空間中引入七個[陸標](http://terms.naer.edu.tw/detail/264942/)位置。每個陸標我們都會定義一個控制器，這控制器可以以直接路徑帶我們到陸標去(參見[48])。每個控制器都只在狀態(states)的有限範圍內可用，這種情況下就是在與相對應陸標的一定距離內。然後每個控制器都會定義一個option：控制器的陸標周圍的圓形區域就是option的啟動集合(initiation set)，控制器本身就是policy，而到達目標陸標就是終止條件。我們用$\mathcal{O}$來表示七個陸標選項(landmark options)的集合。在目標位置0.01範圍內的任何action都會轉移至終止狀態(terminal state)，discount rate $\gamma=1$，所有的轉移得到的reward皆為-1，所以這會是一個最短時間的任務。 ::: :::info ![](https://hackmd.io/_uploads/ryafFP0TY.png) Fig. 7. Using interruption to improve navigation with landmark-directed controllers. The task (top) is to navigate from S to G in minimum time using options based on controllers that run each to one of seven landmarks (the black dots.) The circles show the region around each landmark within which the controllers operate. The thin line shows the SMDP solution, the optimal behavior that uses only these controllers without interrupting them, and the thick line shows the corresponding solution with interruption, which cuts the corners. The lower two panels show the state-value functions for the SMDP and interrupted solutions. Fig. 7. 使用interruption來改進具陸標導向控制器(landmark-directed controllers)的導航。任務(上面)是使用基於控制器的option在最短時間從S導航至G，每個控制器都會行駛到七個陸標的其中一個(圖上的黑點)。圓圈圈說明的是控制器所能操作的每個陸標周圍的區域。細線的部份則是SMDP的解決方案，也就是單純的使用這些控制器而不需要中斷(interrupting)它們的最佳行為，粗線的部份則是採取中斷(interrupting)的解決方案，單刀直入。下面兩個面板則說明了SMDP與採取中斷的解決方案，兩個方法的state-value functions。 ::: :::info One of the landmarks coincides with the goal, so it is possible to reach the goal while picking only from O. The optimal policy within O runs from landmark to landmark, as shown by the thin line in the upper panel of Fig. 7. This is the optimal solution to the SMDP defined by O and is indeed the best that one can do while picking only from these options. But of course one can do better if the options are not followed all the way to each landmark. The trajectory shown by the thick line in Fig. 7 cuts the corners and is shorter. This is the interrupted policy with respect to the SMDP-optimal policy. The interrupted policy takes 474 steps from start to goal which, while not as good as the optimal policy in primitive actions (425 steps), is much better, for nominal additional cost, than the SMDPoptimal policy, which takes 600 steps. The state-value functions, V µ = V ∗ O and V µ0 for the two policies are shown in the lower part of Fig. 7. Note how the values for the interrupted policy are everywhere greater than the values of the original policy. A related but larger application of the interruption idea to mission planning for uninhabited air vehicles is given in [75]. ::: :::success 其中一陸標與目標是相吻合的，因此，只有從$\mathcal{O}$中選擇才有可能到達目標。$\mathcal{O}$內的option policy會從一個陸標行駛到另一個陸標，如上面Fig. 7.的上圖細線所示。Fig. 7.中粗線所示的軌跡截彎取直，而且路徑較短。這本質上是SMDP-optimal policy的interrupted policy。interrupted policy從開始到目標走了474個steps，雖然不如primitive actions的optimal policy，425個step，不過就名目上的額外成本而言，還是比SMDP optimal policy好太多了，它跑了600個steps。兩個policy的state-value functions，$V^\mu = V^*_\mathcal{O}$與$V^{\mu'}$，如Fig. 7.下部所示。也可以看的到，interrupted policy是完全屌打original policy。在參考[75]給出一個使用interruption idel(中斷的概念)在無人機的任務規劃中的相關應用說明。 ::: :::info Fig. 8 shows results for an example using controllers/options with dynamics. The task here is to move a mass along one dimension from rest at position 0 to rest at position 2, again in minimum time. There is no option that takes they system all the way from 0 to 2, but we do have an option that takes it from 0 to 1 and another option that takes it from any position greater than 0.5 to 2. Both options control the system precisely to its target position and to zero velocity, terminating only when both of these are correct to within $\epsilon=0.0001$. Using just these options, the best that can be done is to first move precisely to rest at 1, using the first option, then re-accelerate and move to 2 using the second option. This SMDP-optimal solution is much slower than the corresponding interrupted solution, as shown in Fig. 8. Because of the need to slow down to near-zero velocity at 1, it takes over 200 time steps, whereas the interrupted solution takes only 121 steps. ::: :::success Fig. 8.說明了使用動態控制器/選項(options)的範例結果。這個任務是將一個質量(mass)沿著一個維度從靜止位置0移動到靜止位置2，一樣的，目標是最短時間內移動。這次你沒有option可以從0直接到2，不過我們有一個option可以從0到1，另一個option是從大於0.5的任何位置到2。兩個options都能將系統精確地到達其目標位置與[零速度](http://terms.naer.edu.tw/detail/548872/)，只有當兩個options都正確的到$\epsilon=0.0001$以內才會終止。僅僅使用這些options，最好的方法是，首先使用第一個option來精確的移動到靜止位置1，然後使用第二個option重新加速移動到2。如Fig. 8.所，SMDP-optimal solution比起相對應的interrupted solution來的慢多了。這是因為在1這個位置必需要減到接近零的速度，這需要200個time steps，而interrupted solution就只需要121個steps。 ::: :::info ![](https://hackmd.io/_uploads/S149TO06Y.png) Fig. 8. Phase-space plot of the SMDP and interrupted policies in a simple dynamical task. The system is a mass moving in one dimension: $x_{t+1} = x_t + \dot{x}_{t+1}, \dot{x}_{t+1} = \dot{x}_t + a_t - 0.175\dot{x}_t$ where $x_t$ is the position, $\dot{x}_t$ the velocity, 0.175 a coefficient of friction, and the action $a_t$ an applied force. Two controllers are provided as options, one that drives the position to zero velocity at $x^*=1$ and the other to $x^*=2$. Whichever option is being followed at time $t$, its target position $x^*$ determines the action taken, according to $a_t = 0.01(x^* - x_t)$. Fig. 8. 在簡單動態任務中，SMDP與interrupted policies的[相空間](http://terms.naer.edu.tw/detail/261771/)圖。系統是一個在一維中的質量運動(mass moving)：$x_{t+1} = x_t + \dot{x}_{t+1}, \dot{x}_{t+1} = \dot{x}_t + a_t - 0.175\dot{x}_t$，其中$x_t$是位置，$\dot{x}_t$是速度、0.175是摩擦系數與action $a_t$為[作用力](http://terms.naer.edu.tw/detail/722961/)。兩個控制器做為options，一個是在$x^*=1$的地方驅動位置至零速度，另一個驅動至$x^*=2$。無論你在時間$t$是依循那一個option，它的目標位置$x^*$會根據$a_t = 0.01(x^* - x_t)$確定採取的action。 ::: ## 5. Intra-option model learning :::info In this section we introduce a new method for learning the model, $r^o_s$ and $p^o_{ss'}$, of an option $o$, given experience and knowledge of $o$ (i.e., of its $\mathcal{I}, \pi, \beta$). Our method requires that $\pi$ be deterministic and that the option be Markov. For a semi-Markov option, the only general approach is to execute the option to termination many times in each state $s$, recording in each case the resultant next state $s'$, cumulative discounted reward $r$, and elapsed time $k$. These outcomes are then averaged to approximate the expected values for $r^o_s$ and $p^o_{ss'}$ given by (6) and (7). For example, an incremental learning rule for this could update its model after each execution of $o$ by $$ \widehat{r}^o_s = \widehat{r}^o_s + \alpha[r - \widehat{r}^o_s] \tag{16} $$ and $$ \widehat{p}^o_{sx} = \widehat{p}^o_{sx} + \alpha[\gamma^k \delta_{s'x} - \widehat{p}^o_{sx}] \tag{17} $$ for all $x \in \mathcal{S+}$, where $\delta_{s'x}=1$ if $s' = x$ and is 0 else, and where the step-size parameter, $\alpha$, may be constant or may depend on the state, option, and time. For example, if α$\alpha$ is 1 divided by the number of times that $o$ has been experienced in $s$, then these updates maintain the estimates as sample averages of the experienced outcomes. However the averaging is done, we call these SMDP model-learning methods because, like SMDP value-learning methods, they are based on jumping from initiation to termination of each option, ignoring what happens along the way. In the special case in which $o$ is a primitive option, SMDP modellearning methods reduce to those used to learn conventional one-step models of actions. ::: :::success 在這一節中，我們要來介紹一個用於學習模型option $o$的新方法，也就是$r^o_s$與$p^o_{ss'}$，給定$o$的經驗與知識(即它的$\mathcal{I}, \pi, \beta$)。這個方法的要求就是$\pi$要是確定性的，而且option要是Markov。對於semi-Markov options，唯一通用的方法就是在每個state $s$多次的終止option，然後記錄產生的下一個state $s'$、累積折扣報酬$r$，與經過的時間$k$。然後把這些結果拿來計算平均，以此近似由公式6、7所給出的$r^o_s$與$p^o_{ss'}$的期望值。舉例來說，可以用下面增量學習的規則在每次執行$o$之後更新其模型： $$ \widehat{r}^o_s = \widehat{r}^o_s + \alpha[r - \widehat{r}^o_s] \tag{16} $$ 與 $$ \widehat{p}^o_{sx} = \widehat{p}^o_{sx} + \alpha[\gamma^k \delta_{s'x} - \widehat{p}^o_{sx}] \tag{17} $$ 這是針對所有的$x \in \mathcal{S+}$，其中，當$s' = x$，則$\delta_{s'x}=1$反正為0，然後，step-size這個參數，也就是$\alpha$，也許會是一個常數，或者取決於state、option、time。舉例來說，如果$\alpha$等於1除上$o$在$s$中經歷過的次數，那這些更新就會將估值保持在經驗結果的樣本平均。不管怎麼平均，我們把這些都稱為SMDP model-learning methods，因為，像是SMDP value-learning methods，它們是基於從每個option的啟動到結束的[跳動](http://terms.naer.edu.tw/detail/3484078/)，這忽略了過程中發生的事情。在option $o$是primitive option的特殊情況下，SMDP model learning methods則會簡化為用於學習傳統one-step actions的models。 ::: :::info One disadvantage of SMDP model-learning methods is that they improve the model of an option only when the option terminates. Because of this, they cannot be used for nonterminating options and can only be applied to one option at a time—the one option that is executing at that time. For Markov options, special temporal-difference methods can be used to learn usefully about the model of an option before the option terminates. We call these intra-option methods because they learn about an option from a fragment of experience “within” the option. Intra-option methods can even be used to learn about an option without ever executing it, as long as some selections are made that are consistent with the option. Intra-option methods are examples of off-policy learning methods [72] because they learn about the consequences of one policy while actually behaving according to another. Intra-option methods can be used to simultaneously learn models of many different options from the same experience. Intra-option methods were introduced in [71], but only for a prediction problem with a single unchanging policy, not for the full control case we consider here and in [74]. ::: :::success SMDP model-learning methods的一個缺點就是，只有在option終止的時候才有辦法改進option的模型。因此，它們不能拿來用在nonterminating options(非終止的選項)，而且一次只能用在一個option上，也就是當下正在執行的一個option。對於Markov options，可以用特殊的temporal-difference methods(時差方法)在option終止之前有效地學習關於option的模型。這樣的方法我們稱之為intra-option methods，因為它們是從option內部("within")的經驗片段中來學習option。Intra-option methods甚至可以在沒有執行option的情況下瞭解option，只要所做的選擇與這個option是一致就可以。Intra-option methods是一種off-policy learning methods[72]，因為它們學習一個策略(policy)的後果，同時實際根據另一個策略(policy)行動。Intra-option methods可以同時地從相同的經驗學習不同的options的模型。在參考[71]中引入了Intra-option methods，不過只是用在一個使用不變策略的預測問題，不像我們這邊或是參考[74]裡面所考慮的完全控制情況。 ::: :::info Just as there are Bellman equations for value functions, there are also Bellman equations for models of options. Consider the intra-option learning of the model of a Markov option $o= \langle \mathcal{I}, \pi, \beta \rangle$. The correct model of $o$ is related to itself by $$ \begin{align} r^o_s &= \sum_{a\in\mathcal{A}_s} \pi(s, a)E\left\{r+\gamma(1-\beta(s'))r^o_{s'} \right\} \\ & \text{其中$r$與$s'$是在state $s$採取action $a$得到的reward與next state} \\ &= \sum_{a\in\mathcal{A}_s} \pi(s,a)\Big[r^a_s + \sum_{s'} p^a_{ss'}(1-\beta(s'))r^o_{s'} \Big] \end{align} $$ and $$ \begin{align} p^o_{sx} &= \sum_{a\in\mathcal{A}_s}\pi(s,a)\gamma E\left\{(1-\beta(s'))p^o_{s'x} + \beta(s')\delta_{s'x} \right\} \\ &= \sum_{a\in\mathcal{A}_s}\pi(s,a) \sum_{s'}p^a_{s'x}[(1-\beta(s'))p^o_{s'x} + \beta(s')\delta_{s'x}] \end{align} $$ ,for all $s,x \in\mathcal{S}$. How can we turn these Bellman-like equations into update rules for learning the model? First consider that action $a_t$ is taken in $s_t$ , and that the way it was selected is consistent with $o= \langle \mathcal{I}, \pi, \beta \rangle$, that is, that $a_t$ was selected with the distribution $\pi(s_t, \cdot)$. Then the Bellman equations above suggest the temporal-difference update rules $$ \widehat{r}^o_{s_t} \leftarrow \widehat{r}^o_{s_t} + \alpha[r_{t+1} + \gamma(1 - \beta(s_{t+1})) \widehat{r}^o_{s_{t+1}} - \widehat{r}^o_{s_t}] \tag{18} $$ and $$ \widehat{p}^o_{s_tx} \leftarrow \widehat{p}^o_{s_tx} + \alpha[ \gamma(1-\beta(s_{t+1})) \widehat{p}^o_{s_{t+1}x} + \gamma\beta(s_{t+1})\delta_{s_{t+1}x} - \widehat{p}^o_{s_t}x] \tag{19} $$ for all $x \in \mathcal{S}^+$, where $\widehat{p}^o_{ss'}$ and $\widehat{r}^o_s$ are the estimates of $p^o_{ss'}$ and $r^o_s$, respectively, and $\alpha$ is a positive step-size parameter. The method we call one-step intra-option model learning applies these updates to every option consistent with every action taken, $a_t$. Of course, this is just the simplest intra-option model-learning method. Others may be possible using eligibility traces and standard tricks for off-policy learning (as in [71]) ::: :::success 正如value functions有著Bellman equations那樣，options的model也有它的Bellman equations。考慮一個Markov option模型的intra-option learning，$o= \langle \mathcal{I}, \pi, \beta \rangle$。$o$的正確模型與本身相關 $$ \begin{align} r^o_s &= \sum_{a\in\mathcal{A}_s} \pi(s, a)E\left\{r+\gamma(1-\beta(s'))r^o_{s'} \right\} \\ & \text{其中$r$與$s'$是在state $s$採取action $a$得到的reward與next state} \\ &= \sum_{a\in\mathcal{A}_s} \pi(s,a)\Big[r^a_s + \sum_{s'} p^a_{ss'}(1-\beta(s'))r^o_{s'} \Big] \end{align} $$ 與 $$ \begin{align} p^o_{sx} &= \sum_{a\in\mathcal{A}_s}\pi(s,a)\gamma E\left\{(1-\beta(s'))p^o_{s'x} + \beta(s')\delta_{s'x} \right\} \\ &= \sum_{a\in\mathcal{A}_s}\pi(s,a) \sum_{s'}p^a_{s'x}[(1-\beta(s'))p^o_{s'x} + \beta(s')\delta_{s'x}] \end{align} $$ 這適用於所有的$s,x \in\mathcal{S}$。我們要如何將這些Bellman-like equations轉為學習模型的更新規則？首先考慮的是，在state $s_t$採取action $a_t$，並且它的選擇方式是與$o= \langle \mathcal{I}, \pi, \beta \rangle$一致的，也就是說，$a_t$的選擇是根據$\pi(s_t, \cdot)$這個分佈。這樣子，上面的Bellman equations所提出的temporal-difference update rules就會是： $$ \widehat{r}^o_{s_t} \leftarrow \widehat{r}^o_{s_t} + \alpha[r_{t+1} + \gamma(1 - \beta(s_{t+1})) \widehat{r}^o_{s_{t+1}} - \widehat{r}^o_{s_t}] \tag{18} $$ 與 $$ \widehat{p}^o_{s_tx} \leftarrow \widehat{p}^o_{s_tx} + \alpha[ \gamma(1-\beta(s_{t+1})) \widehat{p}^o_{s_{t+1}x} + \gamma\beta(s_{t+1})\delta_{s_{t+1}x} - \widehat{p}^o_{s_t}x] \tag{19} $$ 上面公式是用於所有的$x \in \mathcal{S}^+$，其中$\widehat{p}^o_{ss'}$與$\widehat{r}^o_s$分別為$p^o_{ss'}$與$r^o_s$的估測值，$\alpha$為正數的step-size。這個方法我們將之稱為one-step intra-option model learning，把這些更新用在跟每次所採取的action $a_t$一致的每一個option上。當然，這只是最簡單的intra-option model-learning method。其它也許還會有搭配使用eligibility traces與off-policy learning的標準技巧的方法(如參考[71]) ::: :::info As an illustration, consider model learning in the rooms example using SMDP and intraoption methods. As before, we assume that the eight hallway options are given, but now we assume that their models are not given and must be learned. In this experiment, the rewards were selected according to a normal probability distribution with a standard deviation of 0.1 and a mean that was different for each state–action pair. The means were selected randomly at the beginning of each run uniformly from the [−1, 0] interval. Experience was generated by selecting randomly in each state among the two possible options and four possible actions, with no goal state. In the SMDP model-learning method, Eqs. (16) and (17) were applied whenever an option terminated, whereas, in the intra-option model-learning method, Eqs. (18) and (19) were applied on every step to all options that were consistent with the action taken on that step. In this example, all options are deterministic, so consistency with the action selected means simply that the option would have selected that action. ::: :::success 以此為例，我們來看看使用SMDP與intra-option methods在rooms examples的模型學習狀況。跟先前一樣，我們假設給定八個hallway options，但現在我們假設並沒有給出它們的模型，我們必需要去學習。在這個實驗中，rewards是根據標準差為0.1的normal probability distribution來選擇，但是要注意的是，這個分佈的均值會隨state-action pair的差異而有所不同。在每次執行的開始會從區間為[-1, 0]中均勻地隨機選擇一個平均值。經驗的產生是透過隨機的選擇每個state所屬的兩個可能的optios與四個可能的actions，沒有目標狀態(goal state)。SMDP model-learning method中，公式16、17是應用在option終止的時候，而在intra-option model-learning method中，公式18、19是應用在每一個step與該step所採取的action一致的所有options。在這個範例中，所有的options都是確定性的，因此與所選的action一致的意思就是說，該option會選擇該action。 ::: :::info For each method, we tried a range of values for the step-size parameter, $\alpha=1/2, 1/4, 1/8$, and $1/16$. Results are shown in Fig. 9 for the value that seemed to be best for each method, which happened to be $\alpha=1/4$ in all cases. For the SMDP method, we also show results with the step-size parameter set such that the model estimates were sample averages, which should give the best possible performance of this method (these lines are labeled $1/t$). The figure shows the average and maximum errors over the state–option space for each method, averaged over the eight options and 30 repetitions of the experiment. As expected, the intra-option method was able to learn significantly faster than the SMDP methods. ::: :::success 對於每一種方法，我們試了一系列的step-size，$\alpha=1/2, 1/4, 1/8$與$1/16$。結果如Fig. 9.所示，對每種方法來說看似最好的那個值，在所有的情況下就剛剛好就是$\alpha=1/4$。SMDP的部份，我們還呈現了step-size parameter set的結果，模型估測的是樣本平均，這應該會給出該方法的最佳效能(線標記為$1/t$)。該圖呈現的是每個方法在state-option space上的平均誤差與最大誤差，平均值的話是根據八個options與30次重複實驗計算而得。如所料，intra-option的學習速度明顯快過SMDP。 ::: :::info ![](https://hackmd.io/_uploads/ByLelKkRF.png) Fig. 9. Model learning by SMDP and intra-option methods. Shown are the average and maximum over I of the absolute errors between the learned and true models, averaged over the eight hallway options and 30 repetitions of the whole experiment. The lines labeled ‘SMDP 1/t’ are for the SMDP method using sample averages; all the others used $\alpha=1/4$. Fig. 9. SMDP與intra-option methods的模型學習結果。呈現的是學習到的模型與實際模型之間在$\mathcal{I}$上絕對誤差的平均與最大值，平均值的話是根據八個options與30次重複實驗計算而得。如所料，intra-option的學習速度明顯快過SMDP。標記為'SMDP 1/t'的線是使用樣本平均計算的SMDP；其它皆使用$\alpha=1/4$ ::: ## 6. Intra-option value learning :::info We turn now to the intra-option learning of option values and thus of optimal policies over options. If the options are semi-Markov, then again the SMDP methods described in Section 3.2 may be the only feasible methods; a semi-Markov option must be completed before it can be evaluated. But if the options are Markov and we are willing to look inside them, then we can consider intra-option methods. Just as in the case of model learning, intra-option methods for value learning are potentially more efficient than SMDP methods because they extract more training examples from the same experience. ::: :::success 我們現在轉來看option values的intra-option learning，我們要從中學習到options的optimal policies。如果options為semi-Markov，那就會像Section 3.2所說的那樣，SMDP可能會是唯一可行的方法；semi-Markov option必需在評估之前就要先完成才行。不過，如果options是Markov，而且我們會想去深入究研一下它們的話，那就可以考慮使用intra-option methods。就跟模型學習一樣，用於value learning的intra-option methods可能會比SMDP更有效率，因為它們會從相同的經驗中提取更多的訓練樣本。 ::: :::info For example, suppose we are learning to approximate Q∗ O(s, o) and that o is Markov. Based on an execution of $o$ from $t$ to $t+k$, SMDP methods extract a single training example for $Q^*_\mathcal{O}(s, o)$. But because $o$ is Markov, it is, in a sense, also initiated at each of the steps between $t$ and $t +k$. The jumps from each intermediate $s_i$ to $s_{t+k}$ are also valid experiences with $o$, experiences that can be used to improve estimates of $Q^*_\mathcal{O}(s_i, o)$. Or consider an option that is very similar to o and which would have selected the same actions, but which would have terminated one step later, at $t + k + 1$ rather than at $t + k$. Formally this is a different option, and formally it was not executed, yet all this experience could be used for learning relevant to it. In fact, an option can often learn something from experience that is only slightly related (occasionally selecting the same actions) to what would be generated by executing the option. This is the idea of off-policy training—to make full use of whatever experience occurs to learn as much as possible about all options irrespective of their role in generating the experience. To make the best use of experience we would like off-policy and intra-option versions of value-learning methods such as Q-learning. ::: :::success 舉例來說，假設我們正學習近似$Q^*_\mathcal{O}(s, o)$，並且這個$o$是Markov。基於這個$o$從$t$到$t+k$的執行，SMDP會為$Q^*_\mathcal{O}(s, o)$提取一個訓練樣本。但是因為$o$是Markov，所以某種意義上，它也是在$t$與$t+k$之間的每一個step啟動的。從$s_i$到$s_{t+k}$這中間的每一個[跳動](http://terms.naer.edu.tw/detail/3484078/)也都是$o$的有效經驗，這些經驗就可以拿來改進$Q^*_\mathcal{O}(s_i, o)$的估測值。或者我們可以考慮一個跟$o$很像的option，它會選擇相同的actions，不過它會在一個step之後就終止，也就是在$t+k+1$，而不是$t+k$的時候終止。形式上，這是一個不同的option，但是它又沒有被執行，只是所有的這些經驗都可以拿來學習與之相關的option。事實上，一個option通常可以從經驗中學習到一些事情，即使只是一些跟執行這個option所產生的略微有關的事物(偶爾選擇相同的actions)。這就是一種off-policy training的概念，充份利用發生的任何經驗，盡可能的學習所有相關的options，不過它們在這個生成的經驗中扮演什麼角色。為了可以充份利用經驗，我們會需要一個off-policy、intra-option版本的value-learning methods，就像是Q-learng一樣。 ::: :::info It is convenient to introduce new notation for the value of a state–option pair given that the option is Markov and executing upon arrival in the state: $$ U^*_\mathcal{O}(s, o) = (1 - \beta(s)) Q^*_\mathcal{O}(s, o) + \beta(s) \max_{o'\in\mathcal{O}} Q^*_\mathcal{O}(s, o') $$ Then we can write Bellman-like equations that relate $Q^*_\mathcal{O}(s, o)$ to expected values of $U^*_\mathcal{O}(s', o)$, where $s'$ is the immediate successor to $s$ after initiating Markov option $o=\langle \mathcal{I}, \pi, \beta \rangle$: $$ \begin{align} Q^*_\mathcal{O}(s, o) &= \sum_{a \in \mathcal{A}_s} \pi(s, a) E \left\{ r + \gamma U^*_\mathcal{O}(s', o) \vert s, a \right\} \\ &= \sum_{a \in \mathcal{A}_s} \pi(s, a) \Big[r^a_s + \sum_{s'}p^a_{ss'} U^*_\mathcal{O}(s', o) \Big] \tag{20} \end{align} $$ where $r$ is the immediate reward upon arrival in $s'$ . Now consider learning methods based on this Bellman equation. Suppose action $a_t$ is taken in state $s_t$ to produce next state $s_{t+1}$ and reward $r_{t+1}$, and that $a_t$ was selected in a way consistent with the Markov policy $\pi$ of an option $o=\langle \mathcal{I}, \pi, \beta \rangle$. That is, suppose that at was selected according to the distribution $\pi(s_t, \cdot)$. Then the Bellman equation above suggests applying the off-policy one-step temporal-difference update: $$ Q(s_t, o) \leftarrow Q(s_t, o) + \alpha [(r_{t+1} + \gamma U(s_{t+1}, 0)) - Q(s_t, o)] \tag{21} $$ where $$ U(s, o) = (1 - \beta(s))Q(s, o) + \beta(s) \max_{o' \in \mathcal{O}} Q(s, o') $$ ::: :::success 考慮到option是Markov，而且是在到達一個state的時候執行，幫state-option pair的value引入新的符號是蠻方便的： $$ U^*_\mathcal{O}(s, o) = (1 - \beta(s)) Q^*_\mathcal{O}(s, o) + \beta(s) \max_{o'\in\mathcal{O}} Q^*_\mathcal{O}(s, o') $$ 這樣我們就可以寫出Bellman-like equations，把$Q^*_\mathcal{O}(s, o)$跟$U^*_\mathcal{O}(s', o)$的期望值關聯起來，其中$s'$是在state $s$啟動 Markov option $o=\langle \mathcal{I}, \pi, \beta \rangle$之後立即的後續狀態： $$ \begin{align} Q^*_\mathcal{O}(s, o) &= \sum_{a \in \mathcal{A}_s} \pi(s, a) E \left\{ r + \gamma U^*_\mathcal{O}(s', o) \vert s, a \right\} \\ &= \sum_{a \in \mathcal{A}_s} \pi(s, a) \Big[r^a_s + \sum_{s'}p^a_{ss'} U^*_\mathcal{O}(s', o) \Big] \tag{20} \end{align} $$ 其中$r$是在到達$s'$之後得到的立即性的reward。現在我們就可以來考慮一個基於這個貝爾曼方程式的學習方法。假設，在state $s_t$採取action $a_t$得到next state $s_{t+1}$與reward $r_{t+1}$，並且選擇$a_t$的方式與option $o=\langle \mathcal{I}, \pi, \beta \rangle$的Markov policy $\pi$的一致。也就是說，假設$a_t$是根據分佈$\pi(s_t, \cdot)$來選擇。這樣，上面的貝爾曼方程式所提出的一個off-policy one-step temporal-difference更新為： $$ Q(s_t, o) \leftarrow Q(s_t, o) + \alpha [(r_{t+1} + \gamma U(s_{t+1}, 0)) - Q(s_t, o)] \tag{21} $$ 其中 $$ U(s, o) = (1 - \beta(s))Q(s, o) + \beta(s) \max_{o' \in \mathcal{O}} Q(s, o') $$ ::: :::info The method we call one-step intra-option Q-learning applies this update rule to every option $o$ consistent with every action taken, $a_t$ . Note that the algorithm is potentially dependent on the order in which options are updated because, in each update, $U(s, o)$ depends on the current values of $Q(s, o)$ for other options $o'$ . If the options’ policies are deterministic, then the concept of consistency above is clear, and for this case we can prove convergence. Extensions to stochastic options are a topic of current research. ::: :::success 這個方法我們稱為one-step intra-option Q-learning，這個更新規劃會用在與每個所採取的action $a_t$一致的每個option $o$上。注意到，這個演算法可能會取決於options更新的順序，因為，在每次的更新中，$U(s, o)$是取決於其它options $o'$的$Q(s, o)$的當前值(current values)。如果options的poliies是確定性的，那麼上面的一致性的概念就很清楚了，對於這種情況，我們可以證明其收斂性。把這個理論擴展到stochastic options是當前的研究主題。 ::: :::info **Theorem 3** (Convergence of intra-option Q-learning). For any set of Markov options, $\mathcal{O}$, with deterministic policies, one-step intra-option Q-learning converges with probability 1 to the optimal Q-values, $Q^*_\mathcal{O}$, for every option regardless of what options are executed during learning, provided that every action gets executed in every state infinitely often. ::: :::success **Theorem 3** (Convergence of intra-option Q-learning)。對於任意具有確定性策略(policy)的Markov options set $\mathcal{O}$，one-step intra-option Q-learning會以1的機率收斂到最佳的Q-values，也就是$Q^*_\mathcal{O}$，每個option都可以，無論學習期間是那個options被執行，只要每個action在每個state都被無限次的執行過就可以。 ::: :::info **Proof** (Sketch). On experiencing the transition, (s, a, r0 , s0 )$(s, a, r', s')$, for every option $o$ that picks action $a$ in state $s$, intra-option Q-learning performs the following update: $$ Q(s, o) \leftarrow Q(s, o) + \alpha(s, o)[r' + \gamma U(s', o) - Q(s, o)] $$ Our result follows directly from Theorem 1 of [30] and the observation that the expected value of the update operator $r' + \gamma U(s', o)$ yields a contraction, proved below: $$ \begin{align} & \vert E\left\{r' + \gamma U(s', o) \right\} - Q^*_\mathcal{O}(s, o) \vert \\ &= \vert r^a_s + \sum_{s'}p^a_{ss'} U(s', o) - Q^*_\mathcal{O}(s, o) \vert \\ &= \vert r^a_s + \sum_{s'}p^a_{ss'} U(s', o) - r^a_s - \sum_{s'} p^a_{ss'}U^*_\mathcal{O}(s', o) \vert \\ & \leq \vert \sum_{s'} p^a_{ss'}\big[(1 - \beta(s'))(Q(s', o) - Q^*_\mathcal{O}(s', o)) + \beta(s')\max_{o'\in\mathcal{O}}Q(s', o') - \max_{o'\in\mathcal{O}}Q^*_\mathcal{O}(s', o') \big] \vert \\ & \leq \sum_{s'} p^a_{ss'} \max_{s'', o''} \vert Q(s'', o'') - Q^*_\mathcal{O}(s'', o'')\vert \\ & \leq \gamma \max_{s'',o''} \vert Q(s'', o'') - Q^*_\mathcal{O}(s'', o'') \vert \end{align} $$ ::: :::success **Proof** 在經歷轉移，$(s, a, r', s')$之時，對於在state $s$採取action $a$的每個option $o$，intra-option Q-learning會執行下面更新： $$ Q(s, o) \leftarrow Q(s, o) + \alpha(s, o)[r' + \gamma U(s', o) - Q(s, o)] $$ 我們的結果直接來自參考[30]的Theorem 1，以及[更新運算子](http://terms.naer.edu.tw/detail/3189594/)(update operator)$r' + \gamma U(s', o)$的期望值產生[收縮](http://terms.naer.edu.tw/detail/2113634/)(contraction)的觀測結果，證明如下： $$ \begin{align} & \vert E\left\{r' + \gamma U(s', o) \right\} - Q^*_\mathcal{O}(s, o) \vert \\ &= \vert r^a_s + \sum_{s'}p^a_{ss'} U(s', o) - Q^*_\mathcal{O}(s, o) \vert \\ &= \vert r^a_s + \sum_{s'}p^a_{ss'} U(s', o) - r^a_s - \sum_{s'} p^a_{ss'}U^*_\mathcal{O}(s', o) \vert \\ & \leq \vert \sum_{s'} p^a_{ss'}\big[(1 - \beta(s'))(Q(s', o) - Q^*_\mathcal{O}(s', o)) + \beta(s')\max_{o'\in\mathcal{O}}Q(s', o') - \max_{o'\in\mathcal{O}}Q^*_\mathcal{O}(s', o') \big] \vert \\ & \leq \sum_{s'} p^a_{ss'} \max_{s'', o''} \vert Q(s'', o'') - Q^*_\mathcal{O}(s'', o'')\vert \\ & \leq \gamma \max_{s'',o''} \vert Q(s'', o'') - Q^*_\mathcal{O}(s'', o'') \vert \end{align} $$ ::: :::info As an illustration, we applied this intra-option method to the rooms example, this time with the goal in the rightmost hallway, cell $G_1$ in Fig. 2. Actions were selected randomly with equal probability from the four primitives. The update (21) was applied first to the primitive options, then to any of the hallway options that were consistent with the action. The hallway options were updated in clockwise order, starting from any hallways that faced up from the current state. The rewards were the same as in the experiment in the previous section. Fig. 10 shows learning curves demonstrating the effective learning of option values without ever selecting the corresponding options. ::: :::success 以此為例，我們把這個intra-option method用到rooms example，這次的目標是最右邊走廊(hallway)，也就是Fig. 2.中那個$G_1$。actions的部份會從四個primitives以等機率的方式隨機的選擇。公式21的更新會先用於primitive options，然後再用於與action一致的任一個hallway options。hallway options會以順時針的順序更新，從任意一個你從當下狀態(current state)面向上的那個走廊(hallway)開始。rewards跟上一節的實驗相同。Fig. 10.顯示其學習曲線，說明了在不選擇相對應options的情況下對option values的有效學習。 ::: :::info Intra-option versions of other reinforcement learning methods such as Sarsa, TD(λ), and eligibility-trace versions of Sarsa and Q-learning should be straightforward, although there has been no experience with them. The intra-option Bellman equation (20) could also be used for intra-option sample-based planning. ::: :::success Intra-optin版本的其它強化學習方法，像是Sarsa、TD(\lambda)、與eligibility-trace版本的Sarsa與Q-learning應該是很簡單，儘管還沒有針對這幾個方法做實驗。公式20的intra-option Bellman equation也可以用在intra-option sample-based的planning。 ::: :::info ![](https://hackmd.io/_uploads/BJjrM4eCt.png) Fig. 10. The learning of option values by intra-option methods without ever selecting the options. Experience was generated by selecting randomly among actions, with the goal at $G_1$. Shown on the left is the value of the greedy policy, averaged over all states and 30 repetitions of the experiment, as compared with the value of the optimal policy. The right panel shows the learned option values for state $G_2$ approaching their correct values. Fig. 10. 利用intra-option methods學習option values，無需選擇options。透過在actions裡面隨機選擇來產生學習經驗，目標為$G_1$。左邊的圖說明的是greedy policy的value，在所有的states和30次重複實驗中取平均值，跟optimal policy做比較。右邊的部份說明的是針對state $G_2$所學習到的option values，看的出來是接近正確的值(correct values)。 ::: ## 7. Subgoals for learning options :::info Perhaps the most important aspect of working between MDPs and SMDPs is that the options making up the SMDP actions may be changed. We have seen one way in which this can be done by changing their termination conditions. Perhaps more fundamental than that is changing their policies, which we consider briefly in this section. It is natural to think of options as achieving subgoals of some kind, and to adapt each option’s policy to better achieve its subgoal. For example, if the option is open-the-door, then it is natural to adapt its policy over time to make it more effective and efficient at opening the door, which may make it more generally useful. It is possible to have many such subgoals and learn about them each independently using an off-policy learning method such as Q-learning, as in [17,31,38,66,78]. In this section we develop this idea within the options framework and illustrate it by learning the hallway options in the rooms example. We assume the subgoals are given and do not address the larger question of the source of the subgoals. ::: :::success 也許在MDPs與SMDPs之間運作最重要的部份是構成SMDP動作的options也許會改變。我們已經看到一種可以透過改變它們的終止條件來做到這一點。也許從根本來說這已經改變它們的policies，我們將在這一節中簡要的討論。很自然地，我們可以把options想成是實現某種子目標的過程，然後會調整每個option的policy，以此更好的實現其子目標。舉例來說，如果option是open-the-door，那很自然地，它會隨著時間來調整它的policy，讓它可以在"開門"這件事上更有效和高效，這可能會使其更普遍地有用。你是可以有這麼多這樣的子目標的，而且可以使用像是Q-learning這類off-policy的學習方法來各別學習，如[17,31,38,66,78]。在這一節裡面，我們會在options framework開發這個想法，然後用學習rooms examples中的hallway options來說明這個概念。我們假設這些子目標是給定的，所以並不會有解決子問題來源的這種大型問題。 ::: :::info A simple way to formulate a subgoal for an option is to assign a terminal subgoal value, $g(s)$, to each state $s$ in a subset of states $\mathcal{G} \subseteq \mathcal{S}$. These values indicate how desirable it is for the option to terminate in each state in $\mathcal{G}$. For example, to learn a hallway option in the rooms task, the target hallway might be assigned a subgoal value of +1 while the other hallway and all states outside the room might be assigned a subgoal value of 0. Let $\mathcal{O}_g$ denote the set of options that terminate only and always in $\mathcal{G}$ (i.e., for which $\beta(s)=0$ for $s \notin \mathcal{G}$ and $\beta(s)=1$ for $s \in \mathcal{G}$). Given a subgoal-value function $g \to \mathcal{G} \to \mathfrak{R}$, one can define a new state-value function, denoted $V^o_g(s)$, for options $o \in \mathcal{O}_g$, as the expected value of the cumulative reward if option $o$ is initiated in state $s$, plus the subgoal value $g(s')$ of the state $s'$ in which it terminates, both discounted appropriately. Similarly, we can define a new action-value function $Q^o_g(s, a) = V^{ao}_g(s)$ for actions $a \in mathcal{A}_s$ and options $o \in \mathcal{O}_g$. ::: :::success 有個簡單的方法來幫option弄個子目標，就是把最終子目標的值(subgal value)，也就是$g(s)$，分配給狀態子集$\mathcal{G} \subseteq \mathcal{S}$裡面的每個狀態(state)$s$。這些值(values)指出在$\mathcal{G}$裡面的每個state中終止options有多麼理想。舉例來說，要學習這個房間任務(rooms task)的hallway option，我們可以把target hallway分派個+1的subgoal value，然後其它的hallway與房間外的所有state就給個0的subgoal value。假設$\mathcal{O}_g$表示只會而且總是在$\mathcal{G}$終止的那些options的集合(即，對於$s \notin \mathcal{G}$的時候$\beta(s)=0$，且對於$s \in \mathcal{G}$的時候$\beta(s)=1$)。給定一個subgoal-value function $g \to \mathcal{G} \to \mathfrak{R}$，我們可以定義一個新的state-value function，以$V^o_g(s)$來表示，對於options $o \in \mathcal{O}_g$，如果option $o$在state $s$啟動，那就可以做為累積報酬(cumulative reward)的期望值，加上其終止狀態state $s'$的子目標值(subgoal value)$g(s')$，兩個值都會適當來點折扣(discounted)。類似地，我們可以為actions $a \in mathcal{A}_s$與options $o \in\mathcal{O}_s$定義一個新的action-value function $Q^o_g(s, a) = V^{ao}_g(s)$ ::: :::info Finally, we can define optimal value functions for any subgoal $g$: $$ V^*_g(s) = \max_{o \in \mathcal{O}_s}V^o_g(s) \text{ and } Q^*_g(s, a) = \max_{o \in \mathcal{O}_s}Q^o_g(s,a) $$ Finding an option that achieves these maximums (an optimal option for the subgoal) is then a well defined subtask. For Markov options, this subtask has Bellman equations and methods for learning and planning just as in the original task. For example, the one-step tabular Q-learning method for updating an estimate $Q_g(s_t, a_t)$ of $Q^*_g(s_t, a_t)$ is $$ \begin{align} & Q_g(s_t, a_t) \leftarrow Q_g(s_t, a_t) + \alpha \big[r_{t+1} + \gamma \max_aQ_g(s_{t+1}, a) - Q_g(s_t, a_t) \big] , \\ & \text{ if } s_{t+1}\notin \mathcal{G}, \\ & Q_g(s_t, a_t) \leftarrow Q_g(s_t, a_t) + \alpha \big[r_{t+1} + \gamma g(s_{t+1}) - Q_g(s_t, a_t) \big] , \\ & \text{ if } s_{t+1}\in \mathcal{G}, \end{align} $$ ::: :::success 最終，我們可以為任意的子目標$g$定義其optimal value functions： $$ V^*_g(s) = \max_{o \in \mathcal{O}_s}V^o_g(s) \text{ and } Q^*_g(s, a) = \max_{o \in \mathcal{O}_s}Q^o_g(s,a) $$ 找出一個可以達到這些最大值的option(子目標的optimal option)就是一個定義明確的子任務。對Markov options，這個子任務有著Bellman equations以及用於learning與planning的方法，就像在原始任務中那樣。舉例來說，用於更新$Q^*_g(s_t, a_t)$的$Q_g(s_t, a_t)$估測值的one-step tabular Q-learning為： $$ \begin{align} & Q_g(s_t, a_t) \leftarrow Q_g(s_t, a_t) + \alpha \big[r_{t+1} + \gamma \max_aQ_g(s_{t+1}, a) - Q_g(s_t, a_t) \big] , \\ & \text{ if } s_{t+1}\notin \mathcal{G}, \\ & Q_g(s_t, a_t) \leftarrow Q_g(s_t, a_t) + \alpha \big[r_{t+1} + \gamma g(s_{t+1}) - Q_g(s_t, a_t) \big] , \\ & \text{ if } s_{t+1}\in \mathcal{G}, \end{align} $$ ::: :::info As a simple example, we applied this method to learn the policies of the eight hallway options in the rooms example. Each option was assigned subgoal values of +1 for the target hallway and 0 for all states outside the option’s room, including the off-target hallway. The initial state was that in the upper left corner, actions were selected randomly with equal probability, and there was no goal state. The parameters were $\gamma=0.9$ and $\alpha=0.1$. All rewards were zero. Fig. 11 shows the learned values for the hallway subgoals reliably approaching their ideal values. ::: :::success 做為一個簡單的範例，我們把這個方法用在學習rooms example的八個hallway options的polciies。每個option都為target hallway分配一個+1的subgoal values，option的房間外的所有狀態(包含off-target hallway)的subgoal values皆為0。初始狀態在左上角，以等機率的方式隨機選擇actions，沒有目標狀態(goal state)。參數的部份，$\gamma=0.9$、$\alpha=0.1$。所有的rewards皆為0。Fig. 11說明的是，hallway subgoals學習到的values可以可靠地接近理想值。 ::: :::info ![](https://hackmd.io/_uploads/S1SfiVMRt.png) Fig. 11. Learning subgoal-achieving hallway options under random behavior. Shown on the left is the error between $Q_g(s, a)$ and $Q^*_g(s, a)$ averaged over $s\in\mathcal{I}, a\in\mathcal{A}$ and 30 repetitions. The right panel shows the learned state values (maximum over action values) for two options at state $G_2$ approaching their correct values. Fig. 11.使用隨機行為的情況下學習subgoal-achieving hallway options。左圖說明的是$Q_g(s, a)$與$Q^*_g(s, a)$之間的誤差，取$s\in\mathcal{I}, a\in\mathcal{A}$與30次重複計算的平均。右圖說明的是，在state $G_2$兩個options學習到的state values是有接近其正確值‧ ::: ## 8. Conclusion :::info Representing knowledge flexibly at multiple levels of temporal abstraction has the potential to greatly speed planning and learning on large problems. We have introduced a framework for doing this within the context of reinforcement learning and MDPs. This context enables us to handle stochastic environments, closed-loop policies, and goals in a more general way than has been possible in classical AI approaches to temporal abstraction. Our framework is also clear enough to be learned, used, and interpreted mechanically, as we have shown by exhibiting simple procedures for learning and planning with options, for learning models of options, and for creating new options from subgoals. ::: :::success 在多個時間抽象層級上靈活地表示知識可能可以大大的加速在大型問題上的planning與learning的速度。我們在強化學習與MDP的背景下引入這麼樣的一個框架來做這件事。這種情境讓我們可以比經典AI處理時間抽象方法更一般性的方式來處理隨機環境、closed-loop policies(閉環策略？)與目標。我們的框架也算是夠清楚，可以讓你機械式的去學習、使用、解釋，正如我們透過用於learning與planning的options、用於學習options的、以及用於從子目標中建立一個新的options的簡單範例所說明‧ :::info The foundation of the theory of options is provided by the existing theory of SMDPs and associated learning methods. The fact that each set of options defines an SMDP provides a rich set of planning and learning methods, convergence theory, and an immediate, natural, and general way of analyzing mixtures of actions at different time scales. This theory offers a lot, but still the most interesting cases are beyond it because they involve interrupting, constructing, or otherwise decomposing options into their constituent parts. It is the intermediate ground between MDPs and SMDPs that seems richest in possibilities for new algorithms and results. In this paper we have broken this ground and touched on many of the issues, but there is far more left to be done. Key issues such as transfer between subtasks, the source of subgoals, and integration with state abstraction remain incompletely understood. The connection between options and SMDPs provides only a foundation for addressing these and other issues. ::: :::success options的理論基礎來自於現有已知的SMDPs的理論與相關學習方法。每個options的集合都定義一個SMDP，這件事提供了豐富的planning與learning的方法，收斂理論、以及一種即時、自然且通用的方法來分析不同時間尺度下的混合動作(actions)。這個理論提供不少東西，不過最有趣的部份仍然超出它的範圍，因為它們涉及到中斷(interrupting)、建構(constructing)、或是以其它方式將options分解為它們的組成部份。這是介於MDPs與SMDPs中間的一個中間地帶，似乎最有可能出現新的演算法與結果。在這篇論文中，我們打破很多成規，也觸及很多問題，不過還是有很長的一段路要走。像是子任務間的轉移、子目標的來源、以及狀態抽象的集成等關鍵問題仍然尚未完全理解。options與SMDPs之間的連結也只是為解決這些問題與其它問題提供一個基礎而以。 ::: :::info Finally, although this paper has emphasized temporally extended action, it is interesting to note that there may be implications for temporally extended perception as well. It is now common to recognize that action and perception are intimately linked. To see the objects in a room is not so much to label or locate them as it is to know what opportunities they afford for action: a door to open, a chair to sit on, a book to read, a person to talk to. If the temporally extended actions are modeled as options, then perhaps the models of the options correspond well to these perceptions. Consider a robot learning to recognize its battery charger. The most useful concept for it is the set of states from which it can successfully dock with the charger, and this is exactly what would be produced by the model of a docking option. These kinds of action-oriented concepts are appealing because they can be tested and learned by the robot without external supervision, as we have shown in this paper. ::: :::success 最後，雖然這篇論文時間延伸的動作(temporally extended action)，不過比較有趣的一點是，這對時間延伸的感知可能也有所影響。現在普遍認為action與perception(感知)息息相關。在房間看到物件並不是為了去標記或是定位它們，而是說這給了action什麼樣子的機會：一扇可以打開的門、一張可以坐的椅子、一本可以讀的書、一個可以交談的人。如果這種時間延申的動作(actions)弄成選項(options)，那麼，options的模型(model)可能可以很好的對應這些感知器(perceptions)。考慮一個機器人學習辨識其電池充電器。對它而言最有用的概念就是它可以成功的與充電器對接的狀態集合，而這也正是對接選項(docking option)的模型所產生的。如我們在論文中所說的，這些類型的action-oriented concepts是很有吸引力的，因為它們可以在沒有外部監督的情況下由機器人自己來測試、學習。 :::