Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning(1)

# Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning(1) ###### tags: `論文翻譯` [原文連結](https://www.sciencedirect.com/science/article/pii/S0004370299000521) ## Abstract :::info Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key, long-standing challenges for AI. In this paper we consider how these challenges can be addressed within the mathematical framework of reinforcement learning and Markov decision processes(MDPs). We extend the usual notion of action in this framework to include options—closed - loop policies for taking action over a period of time. Examples of options include picking up an object, going to lunch, and traveling to a distant city, as well as primitive actions such as muscle twitches and joint torques. Overall, we show that options enable temporally abstract knowledge and action to be included in the reinforcement learning framework in a natural and general way. In particular, we show that options may be used interchangeably with primitive actions in planning methods such as dynamic programming and in learning methods such as Q-learning. Formally, a set of options defined over an MDP constitutes a semi-Markov decision process (SMDP), and the theory of SMDPs provides the foundation for the theory of options. However, the most interesting issues concern the interplay between the underlying MDP and the SMDP and are thus beyond SMDP theory. We present results for three such cases: (1) we show that the results of planning with options can be used during execution to interrupt options and thereby perform even better than planned, (2) we introduce new intra-option methods that are able to learn about an option from fragments of its execution, and (3) we propose a notion of subgoal that can be used to improve the options them selves. All of these results have precursors in the existing literature; the contribution of this paper is to establish themin a simpler and more general setting with fewer changes to the existing reinforcement learning framework. In particular, we show that these results can be obtained without committing to (or ruling out) any particular approach to state abstraction, hierarchy, function approximation, or the macro-utility problem. 1999 Published by Elsevier Science B.V. All rights reserved. ::: :::success 在時間抽象的各種層級學習、規劃、與表示知識上是AI長期存在的關鍵挑戰。在這篇論文中，我們會考慮如何在強化學習與馬可夫決策過程(MPDs)的數學框架中處理這些挑戰。我們在這個框架中擴展了action常用的符號，以包含options - 在一段時間內採取action的closed-loop policy([閉環](http://terms.naer.edu.tw/detail/12871373/)策略)。options的範例包含撿起物件，吃午餐，去一個遙遠的城市旅行，而primitive action像是肌肉抽動與joint torques(關節什麼的?)。總體來說，我們說明了options能夠以自然、通用的方式將時間抽象的知識與action納入強化學習框架中。特別是，我們說明options也許可以在規劃方法(動態規劃(DP))與學習方法(Q-learning)中與primitive actions交互使用。形式上來說，在MDP上定義一組options被視為semi-Markov decision process(SMDP)，而SMDPs的理論為options的理論奠定基礎。但是，最有趣的問題這涉及了基本的MDP與SMDP之間的交互作用，因此超過SMDP的理論。我們提供三種情況的結果：(1)我們說明在執行過程中以options規劃的結果可以用來中斷options，因此執行結果較規劃(plan)的更好；(2)我們引入新的intrp-option方法，這方法能從其執行片段中學習關於option；(3)我們提出一個子目標(subgoal)的概念，這可以用來改進option本身。所有這些結果在已存在的文獻中都已經有著前導；這篇論文的貢獻在於以更簡單、通用的設置對現有的強化學習框架做較少的改變。特別是，我們證明這些結果不需要什麼特別的方法(狀態抽象、層次結構、函數逼近或是macro-utility)就可以得到。 1999 Published by Elsevier Science B.V. All rights reserved. ::: ## 0. Introduction :::info Human decision making routinely involves choice among temporally extended courses of action over a broad range of time scales. Consider a traveler deciding to undertake a journey to a distant city. To decide whether or not to go, the benefits of the trip must be weighed against the expense. Having decided to go, choices must be made at each leg, e.g. ,whether to fly or to drive, whether to take a taxi or to arrange a ride. Each of these steps involves foresight and decision, all the way down to the smallest of actions. For example, just to call a taxi may involve finding a telephone, dialing each digit, and the individual muscle contractions to lift the receiver to the ear. How can we understand and automate this ability to work flexibly with multiple overlapping time scales? ::: :::success 人工決策通常涉及在一個廣泛的[時間尺度](http://terms.naer.edu.tw/detail/3533279/)上從時間延伸的動作行為中進行選擇。考慮一個旅行家決定去一個遙遠的城市旅行。為了決定是否出發，旅途的利益與費用必需有所權衡。決定出發後，每一站都必需做出選擇，像是要搭飛機或是開車，搭計程車或是安排乘車。每一個步驟都涉及先見之名與決策，一直到最小的actions。舉例來說，就只是打個電話叫車就涉及找電話、撥號、以及各個肌肉的收縮(將接收器抬到耳朵)。我們如何理解並自動化這些有著多個重疊時間尺度的靈活工作的能力。 ::: :::info Temporal abstraction has been explored in AI at least since the early 1970’s, primarily within the context of STRIPS-style planning [18,20,21,29,34,37,46,49,51,60,76]. Temporal abstraction has also been a focus and an appealing aspect of qualitative modeling approaches to AI [6,15,33,36,62] and has been explored in robotics and control engineering [1,7,9,25,39,61]. In this paper we consider temporal abstraction within the framework of reinforcement learning and Markov decision processes (MDPs). This framework has become popular in AI because of its ability to deal naturally with stochastic environments and with the integration of learning and planning [3,4,13,22,64]. Reinforcement learning methods have also proven effective in a number of significant applications [10,42,50,70,77]. ::: :::success 時間抽象至少從1970年初開始在人工智慧領域中被探索，主要是在STRIPS-style planning [18,20,21,29,34,37,46,49,51,60,76]的背景下。時間抽象也一直是人工智慧定性建模方法的關注點與吸引人的方法 [6,15,33,36,62]，而且已經在機器人與控制工程[1,7,9,25,39,61]中進行索引。這篇論文中，我們考慮強化學習框架與馬可夫決策過程(MPDs)的時間抽象。這個框架在人工智慧領域中能夠變的受歡迎是因為它能夠自然處理隨機環境並整合學習與規劃(learning and planning)[3,4,13,22,64]。強化學習方法在很多重要的應用上也被證明是有效的[10,42,50,70,77]。 ::: :::info MDPs as they are conventionally conceived do not involve temporal abstraction or tem-porally extended action. They are based on a discrete time step: the unitary action taken at time $t$ affects the state and reward at time $t + 1$. There is no notion of a course of action persisting over a variable period of time. As a consequence, conventional MDP methods are unable to take advantage of the simplicities and efficiencies sometimes available at higher levels of temporal abstraction. On the other hand, temporal abstraction can be introduced into reinforcement learning in a variety of ways [2,8,11,12,14,16,19,26,28,31,32,38,40,44,45,53,56,57,59,63,68,69,71,73,78–82]. In the present paper we generalize and simplify many of these previous and co-temporaneous works to form a compact, unified framework for temporal abstraction in reinforcement learning and MDPs. We answer the question “What is the minimal extension of the reinforcement learning framework that allows a general treatment of temporally abstract knowledge and action?” In the second part of the paper we use the new framework to develop new results and generalizations of previous results. ::: :::success 傳統上MDPs是不涉及時間抽象或時間延伸的action。它們是基於discrete time step(離散的time step還是不連續的時步)：在時間$t$採取單一個action影響時間$t + 1$的state與reward。它並沒有動作行為在一段可變的時間內持續的概念。這樣的結果，傳統上的MPD方法無法利用有時候在更高的時間抽象級別上可用的簡單性與效率的優點。另一方面，時間抽象能夠用很多方法引入強化學習[2,8,11,12,14,16,19,26,28,31,32,38,40,44,45,53,56,57,59,63,68,69,71,73,78–82]。本篇論文中，我們概括並簡化許多這些先前與同期的研究，以形成一個[緊緻](http://terms.naer.edu.tw/detail/2112843/)、統一用於強化學習與MDPs的時間抽象的框架。我們回答下面問題"強化學習框架的最小延伸是什麼，它允許對時間抽象的知識與行為(action)做一般處理?"。論文的第二部份我們會使用新的框架來開生成新的結果並總結先前的結果。 ::: :::info One of the keys to treating temporal abstraction as a minimal extension of the reinforcement learning framework is to build on the theory of semi-Markov decision processes(SMDPs), as pioneered by Bradtke and Duff [5], Mahadevan et al. [41], and Parr [52]. SMDPs are a special kind of MDP appropriate for modeling continuous-time discrete-event systems. The actions in SMDPs take variable amounts of time and are intended to model temporally-extended courses of action. The existing theory of SMDPs specifies how to model the results of these actions and how to plan with them. However, existing SMDP work is limited because the temporally extended actions are treated as indivisible and unknown units. There is no attempt in SMDP theory to look inside the temporally extended actions, to examine or modify their structure in terms of lower-level actions. As we have tried to suggest above, this is the essence of analyzing temporally abstract actions in AI applications: goal directed behavior involves multiple overlapping scales at which decisions are made and modified. ::: :::success 處理將時間抽象視為強化學習框架的最小延伸的關鍵之一是建立在semi-Markov decision processes(SMDPs)的理論基礎上(Bradtke and Duff [5], Mahadevan et al. [41], and Parr [52])。SMDPs是一種特殊的MDP，適用於建構連續時間、離散事件的系統。SMDPs中的action採用可變的時間量並且預期建構時間延伸的動作行為。現在的SMDPs理論具體說明了如何對這些action的結果做建模，以及如何用它們來規劃(plan)。然而，現有的SMDP的工作是受限的，因為時間延伸動作(temporally extended actions)被視為不可分割、未知的單位。SMDP理論中並沒有嚐試從時間延伸動作就較低級別的動作(lower-level actions)來檢查或修正其架構。正如我們上面嚐試建議的部份，這是分析人工智慧應用程式中的時間抽象動作的本質：目標導向的行為涉及多個重疊尺度(決策的制定與調整)。 ::: :::info In this paper we explore the interplay between MDPs and SMDPs. The base problemwe consider is that of a conventional discrete-time MDP^1^, but we also consider courses ofaction within the MDP whose results are state transitions of extended and variable duration. We use the term options^2^ for these courses of action, which include primitive actions as a special case. Any fixed set of options defines a discrete-time SMDP embedded within the original MDP, as suggested by Fig. 1. The top panel shows the state trajectory over discrete time of an MDP, the middle panel shows the larger state changes over continuous time of an SMDP, and the last panel shows how these two levels of analysis can be superimposed through the use of options. In this case the underlying base system is an MDP, with regular,single-step transitions, while the options define potentially larger transitions, like those of an SMDP, that may last for a number of discrete steps. All the usual SMDP theory applies to the superimposed SMDP defined by the options but, in addition, we have an explicit interpretation of them in terms of the underlying MDP. The SMDP actions (the options) are no longer black boxes, but policies in the base MDP which can be examined, changed, learned, and planned in their own right. ::: :::success 這篇論文中我們探討MDPs與SMDPs之間的交互作用。我們考慮的基本問題是傳統離散時間的MDP(discriete-time MDP)，我們還考慮MDP內的行動過程(courses of action)，其結果是延伸狀態轉移(state transition)與variable duration(可變的持續時間?)。對於這些行動過程，我們使用options這個術語，其包含做為特殊情況的primitive actions。如Figure 1所示，任何固定的options的集合都定義嵌入在原始MDP中的離散時間的SMDP。最上面的圖說明著MDP離散時間的狀態軌跡(state trajectory)，中間則說明著SMDP連續時間上較大的狀態變化，而最下面的圖則說明著如何透過使用options來疊加這兩個層級的分析。這種情況下底層基礎的系統是MDP，具有規則、[單步](http://terms.naer.edu.tw/detail/1286612/)的轉移，而options則定義可能較大的轉移，像是SMDP的轉移，也許會持續多個離散步驟(discrete steps)。所有常用的SMDP理論都適用這些options定義的疊加的SMDP(superimposed SMDP)，但是，除此之外，我們會根據基本的MDP做明確的解釋。SMDP actions(options)不再是黑盒子，而是基本的MDP中的策略(policies)，可以自行檢查、改變、學習、規劃(planned)。 ::: :::info ^1^ In fact, the base system could itself be an SMDP with only technical changes in our framework, but this would be a larger step away from the standard framework. ^1^ 事實上，基礎系統可以是SMDP，只需要在我們的框架中做一些技術上的修正，但這就跟標準框架的差異更多了。(?) ::: :::info ^2^ This term may deserve some explanation. In previous work we have used other terms including “macro-actions”, “behaviors”, “abstract actions”, and “subcontrollers” for structures closely related to options. We introduce a new term to avoid confusion with previous formulations and with informal terms. The term “options” is meant as a generalization of “actions”, which we use formally only for primitive choices. It might at first seem inappropriate that “option” does not connote a course of action that is non-primitive, but this is exactly our intention. We wish to treat primitive and temporally extended actions similarly, and thus we prefer one name for both. ^2^ 這術語也許值得解釋。在先前的研究作業中，我們使用其它的術語包含“macro-actions”、“behaviors”、“abstract actions”與“subcontrollers”來表示與options密切相關的結構。我們引入一個新的術語來避免與先前公式(表述)、非正式術語混淆。"options"意謂著"actions"的概括(泛化?)，我們正式使用actions只用於原始的選擇。"option"並不意味著一個非原始(非原生?non-primitive)的行為過程，乍看之下似乎不適合，但這正是我們的意圖。我們希望處理原始與時空延伸的actions，因此我們希望用一個名稱來表示這兩個。 ::: :::info Fig. 1. The state trajectory of an MDP is made up of small, discrete-time transitions, whereas that of an SMDP comprises larger, continuous-time transitions. Options enable an MDP trajectory to be analyzed in either way. Fig 1：MDP的狀態軌跡由較小的離散時間轉移所組成，而SMDP的狀態軌跡則是由較大的連續時間轉移所組成。Options讓MDP軌距能夠在這兩種情況下任一種方式進行分析。 ![](https://i.imgur.com/rk2wIJw.png) ::: :::info The first part of this paper (Sections 1–3) develops these ideas formally and more fully. The first two sections review the reinforcement learning framework and present its generalization to temporally extended action. Section 3 focuses on the link to SMDP theory and illustrates the speedups in planning and learning that are possible through the use of temporal abstraction. The rest of the paper concerns ways of going beyond an SMDP analysis of options to change or learn their internal structure in terms of the MDP. Section 4 considers the problem of effectively combining a given set of options into a single overall policy. For example, a robot may have pre-designed controllers for servoing joints to positions, picking up objects, and visual search, but still face a difficult problem of how to coordinate and switch between these behaviors [17,32,35,39,40,43,61,79]. Sections 5 and 6 concern intra-option learning—looking inside options to learn simultaneously about all options consistent with each fragment of experience. Finally, in Section 7 we illustrate a notion of subgoal that can be used to improve existing options and learn new ones. ::: :::success 論文第一部份(Section 1-3)正式且更全面的提出這些想法。前兩章回顧強化學習框架並說明其對時間延伸行為(temporally extended action)的泛化(概括?)。Section 3專注在SMDP理論的連結，並說明通過使用時間抽象可能可以加快規劃與學習的速度。其餘的部份涉及到如何超越SMDP analysis of options(改變或學習基於MDP的內部結構的方法)。Section 4考慮的問題是如何有效地將一組給定的option組合成一個單一的總體策略。舉例來說，機器人也許預先定義用於伺服關節位置的控制器，撿取物件，然後視覺搜尋，但是仍然面臨行為之間的協調、切換的問題[17,32,35,39,40,43,61,79]。Section 5、6則涉及intra-option的學習，從inside options同時學習與每個經驗片段一致的所有options。最後，在Section 7中，我們說明一個subgoal的概念，可以用來改進現在的options並學習新的options。 ::: ## 1. The reinforcement learning (MDP) framework MDP的部份可以參考論文作者所出的書，或[連結](https://hackmd.io/@shaoeChen/Hkm3mMjL_)個人讀書記錄，這邊只簡單的記錄數學式。 :::info $$ P^a_{ss'}=Pr\left\{ s_{t+1}=s' \vert s_t=s, a_t = a\right\} $$ ::: :::warning 上面說明的是在時間$t$給定state $s$與action $a$得到state $s'$的機率。 ::: :::info $$ r_s^a = E\left\{ r_{t+1} \vert s_t=s, a_t = a\right\} $$ ::: :::warning 上面說明的是給定state $s$與action $a$得到expected reward $r$，值得注意的是，Sutton的reward是$t+1$，而不是$t$。 ::: :::info $$ \begin{align} V^\pi(s) &= E \left\{r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3 + \dots \vert s_t=s, \pi} \right\} \tag{1} \\ &= E \left\{r_{t+1} + \gamma V^\pi(s_{t+1}) \vert s_t =s ,\pi \right\} \\ &= \sum_{a \in \mathcal{A}_s} \pi(s, a) \left[r^a_s + \gamma \sum_{s'}p^a_{ss'}V^\pi(s') \right] \tag{2} \end{align} $$ ::: :::warning 上面說明的是，使用policy $\pi$遇到state $s$會得到的expected return，也就是state value function，其中$\gamma$是discount factor。 ::: :::info $$ \begin{align} V^*(s) &= \max_\pi V^\pi(s) \tag{3} \\ &= \max_{a \in \mathcal{A}_s} E \left\{r_{t+1} + \gamma V^*(s_{t+1}) \vert s_t=s, a_t=a \right\} \\ &= \max_{a \in \mathcal{A}_s} \left[r^a_s + \gamma \sum_{s'}p^a_{ss'} V^*(s') \right] \tag{4} \end{align} $$ ::: :::warning $V^*(s)$說明的是optimal state value function，簡單說就是，你有一個optimal policy，這個optimal policy搭配每一個state給你的都是最好的結果，也就是optimal state value function。 ::: :::info $$ \begin{align} Q^\pi(s, a) &= E \left\{r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dots \vert s_t=s, a_t=a, \pi \right\} \\ &= r^a_s + \gamma \sum_{s'} p^a_{ss'} V^\pi(s') \\ &= r^a_s + \gamma \sum_{s'} p^a_{ss'} \sum_{a'} \pi(s', a') Q^\pi(s', a') \end{align} $$ ::: :::warning 類似地，你可以把它泛化到Q function，也就是state-action value function，或是action-value function。 ::: :::info $$ \begin{align} Q^*(s, a) &= \max_{\pi}Q^\pi(s, a) \\ &= r^a_s + \gamma \sum_{s'} p^a_{ss'} \max_{a'} Q^*(s', a') \end{align} $$ ::: :::warning 也會有一個最好的action-value function，$Q^*(s, a)$ ::: ## 2. Options :::info As mentioned earlier, we use the term options for our generalization of primitive actions to include temporally extended courses of action. Options consist of three components: a policy $\pi: \mathcal{S} \times \mathcal{A}$, a termination condition $\beta: \mathcal{S}^+ \to [0, 1]$, and an initiation set $\mathcal{I} \subseteq \mathcal{S}$. An option $(\mathcal{I}, \pi, \beta)$ is available in state st if and only if $s_t \in \mathcal{I}$. If the option is taken, then actions are selected according to $\pi$ until the option terminates stochastically according to $\beta$. In particular, a Markov option executes as follows. First, the next action $a_t$ is selected according to probability distribution $\pi(s, \cdot)$. The environment then makes a transition to state $s_{t+1}$, where the option either terminates, with probability $\beta(s_{t+1})$, or else continues, determining $a_{t+1}$ according to $\pi(s_{t+1}, \cdot)$, possibly terminating in $s_{t+2}$ according to $\beta(s_{t+2})$, and so on.^3^ When the option terminates, the agent has the opportunity to select another option. For example, an option named open-the-door might consist of a policy for reaching, grasping and turning the door knob, a termination condition for recognizing that the door has been opened, and an initiation set restricting consideration of openthe-door to states in which a door is present. In episodic tasks, termination of an episode also terminates the current option (i.e., $\beta$ maps the terminal state to 1 in all options). ::: :::success 如先前所說明的，我們用"options"這個術語來泛化包含時間延伸的行為過程的primitive actions。options包含三個元件：policy $\pi: \mathcal{S} \times \mathcal{A}$，終止的條件 $\beta: \mathcal{S}^+ \to [0, 1]$，以及一個initiation set(啟動集合?)$\mathcal{I} \subseteq \mathcal{S}$。只有在$s_t \in \mathcal{I}$的情況，option $(\mathcal{I}, \pi, \beta)$在state $s_t$的時候才是可用的。如果option被選中，那就會根據$\pi$選擇action，一直到這個option根據$\beta$隨機終止。特別是，Markov option的執行如下。首先，下一個action $a_t$是根據機率分佈$\pi(s, \cdot)$所選擇。然後，environment會轉變成state $s_{t+1}$，那option會有$\beta(s_{t+1})$的機率終止，或者繼續，根據$\pi(s_{t+1}, \cdot)$確定$a_{t+1}$，然後在$s_{t+2}$會有$\beta(s_{t+2})$的機率終止，等等，以此類推^3^。當option終止，agent就有機會選擇另外的option。舉例來說，有一個名為open-the-door的option也許包含一個有伸手、抓住、轉開門把的policy，辨別門已經被打開的終止條件，以及考慮限制open-the-door這個option要存在於有門這個state的initiation set。在情節式任務中(episodic tasks，也有人說分節式任務)，終止一個episode同時也會終止當前的option，也就是$\beta$會把所有options的terminal state映射到1。 ::: :::info The initiation set and termination condition of an option together restrict its range of application in a potentially useful way. In particular, they limit the range over which the option’s policy needs to be defined. For example, a handcrafted policy $\pi$ for a mobile robot to dock with its battery charger might be defined only for states $\mathcal{I}$ in which the battery charger is within sight. The termination condition $\beta$ could be defined to be 1 outside of $\mathcal{I}$ and when the robot is successfully docked. A subpolicy for servoing a robot arm to a particular joint configuration could similarly have a set of allowed starting states, a controller to be applied to them, and a termination condition indicating that either the target configuration has been reached within some tolerance or that some unexpected event has taken the subpolicy outside its domain of application. For Markov options it is natural to assume that all states where an option might continue are also states where the option might be taken (i.e., that $\left\{s: \beta{(s)} < 1 \right\} \subseteq \mathcal{I}$). In this case, $\pi$ need only be defined over $\mathcal{I}$ rather than over all of $\mathcal{S}$. ::: :::success option的initiation set與終止條件以一種潛在有效的方法共同限制其應用範圍。特別是，它們限制了options的policy需要被定義的範圍。舉例來說，手工刻的一個用於移動機器人對接其電池充電器的policy $\pi$，也許就定義為只有在state $\mathcal{I}$(電池充電器在視線內)。當機器人成功對接的時候，終止條件$\beta$可以定義為$\mathcal{I}$以外的1。將機械手臂伺服到特定關節配置的subpolicy可以有一組允許的啟動狀態(starting states)，一組應用它們身上的控制器，以及一個終止條件指示著已經到達目標配置(在某個允許公差內)或者某個非預期事件已經讓subpolicy超出應用範圍。對於Markov options，很自然地假設所有的states，其option也許會是持續性的，也有states其option也許會被選到(~~這邊真的翻不出來~~)。這種情況下就只需要再$\mathcal{I}$上定義$\pi$，而不是所有的$\mathcal{S}$。 ::: :::warning 上面一段部份不確定，再請教人 ::: :::info ^3^ The termination condition β plays a role similar to the β in β-models [71], but with an opposite sense. That is, β(s) in this paper corresponds to 1 − β(s) in [71]. ^3^ 終止條件$\beta$扮演著類似於$\beta-$models[71]中的$\beta$，只不過是反過來的。也就是$\beta(s)$在這篇論文是對應於[71]內的$1-\beta(s)$ ::: :::info Sometimes it is useful for options to “timeout”, to terminate after some period of time has elapsed even if they have failed to reach any particular state. This is not possible with Markov options because their termination decisions are made solely on the basis of the current state, not on how long the option has been executing. To handle this and other cases of interest we allow semi-Markov options, in which policies and termination conditions may make their choices dependent on all prior events since the option was initiated. In general, an option is initiated at some time, say $t$, determines the actions selected for some number of steps, say $k$, and then terminates in $s_{t+k}$. At each intermediate time $\tau$, $t \leq \tau < t + k$, the decisions of a Markov option may depend only on $s_\tau$, whereas the decisions of a semi-Markov option may depend on the entire preceding sequence $s_t, a_t, r_{t+1}, s_{t+1}, a_{a+1}, \cdots, r_\tau, s_\tau$, but not on events prior to $s_t$ (or after $s_\tau$). We call this sequence the history from $t$ to $\tau$ and denote it by $h_{t\tau}$ . We denote the set of all histories by $\Omega$. In semi-Markov options, the policy and termination condition are functions of possible histories, that is, they are $\pi: \Omega \times \mathcal{A} \to [0, 1]$ and $\beta: \Omega \to [0, 1]$. Semi-Markov options also arise if options use a more detailed state representation than is available to the policy that selects the options, as in hierarchical abstract machines [52,53] and MAXQ [16]. Finally, note that hierarchical structures, such as options that select other options, can also give rise to higher-level options that are semi-Markov (even if all the lower-level options are Markov). Semi-Markov options include a very general range of possibilities. ::: :::success 有時候讓options "timeout"是很有用的，即使它們未能到達任意狀態，也要能夠在一段時間之後終止。這對於Markov options是不可能的，因為它們的終止決策僅僅是基於當前的state，而不是根據這個option已經被執行多久了。為了處理這種情況與其它有趣的情況，我們可以使用semi-Markov options，這種情況下，policies與終止條件也許能讓它們的選擇是取決於該option啟動狀態以來的所有先前事件。一般來說，option會在某一個時間啟動(initiated)，假設是$t$，然後在某一個steps確定一個actions的選擇，假設是在$k$這個time step，並且在$s_{t+k}$的時候終止。在每個中間時間$\tau$，其$t \leq \tau < t + k$，而Markov options的決策也許只取決於$s_\tau$，而semi-Markov option的決策可能就取決於整個先前的序列，$s_t, a_t, r_{t+1}, s_{t+1}, a_{a+1}, \cdots, r_\tau, s_\tau$，但不會考量到$s_t$之前或是$s_\tau$之後。我們把這個序列稱為$t$到$\tau$的歷史，以$h_{t\tau}$來表示。我們用$\Omega$來表示所有歷史的集合。在semi-Markov options中，policy與termination condition(終止條件)是取決於可能的歷史記錄，也就是$\pi: \Omega \times \mathcal{A} \to [0, 1]$與$\beta: \Omega \to [0, 1]$。如果options使用的比選擇option的policy更詳細的state representaion(狀態表示)，那也會出現semi-Markov options，像是在hieracrhical absstract machines[52, 53]與MAXQ[16]。最後，注意到，[階層式的結構](http://terms.naer.edu.tw/detail/263354/)，像是選擇其它options的options，也會產生較高層次的semi-Markov options(即使所有較低層次的options都是Markov)。semi-Markov options包含非常廣泛的可性能。 ::: :::info Given a set of options, their initiation sets implicitly define a set of available options $\mathcal{O}_s$ for each state $s \in \mathcal{S}$. These $\mathcal{O}_s$ are much like the sets of available actions, $\mathcal{A}_s$. We can unify these two kinds of sets by noting that actions can be considered a special case of options. Each action a corresponds to an option that is available whenever a is available ($\mathcal{I}=\left\{s: a \in \mathcal{A}_s \right\}$), that always lasts exactly one step ($\beta(s)=1,\forall{s} \in \mathcal{S}$), and that selects a everywhere ($(\pi(s, a)=1, \forall{s} \in \mathcal{I})$). Thus, we can consider the agent’s choice at each time to be entirely among options, some of which persist for a single time step, others of which are temporally extended. The former we refer to as single-step or primitive options and the latter as multi-step options. Just as in the case of actions, it is convenient to suppress the differences in available options across states. We let $\mathcal{O}=\bigcup_{s\in\mathcal{S}}\mathcal{O}_s$ denote the set of all available options. ::: :::success 給定一組options，它們的啟動集合(initiation sets)顯示的定義了一組可用的options $\mathcal{O}_s \space \forall s \in \mathcal{S}$。這些$\mathcal{O}_s$比較像是可用的actions的集合，$\mathcal{A}_s$。我們可以通過指出actions可以被視為是一種特殊的options來統一這兩種類型的集合。每個action $a$對應一個option，只要$a$可用，那option就可用($\mathcal{I}=\left\{s: a \in \mathcal{A}_s \right\}$)，而且總是只持續一個step($\beta(s)=1,\forall{s} \in \mathcal{S}$)，而且選擇是四處都有，$(\pi(s, a)=1, \forall{s} \in \mathcal{I})$。因此，我們可以想成agent在每一個時間的選擇都是從options裡面選的，那一些可能是只持續單一個時步(single time step)，其它可能是時間延伸的。single time step的情況稱為single-step或是primitive options，時間延伸的則稱為multi-step options。就跟actions的情況一樣，這對於抑制各state之間可用的options的差異也是很方便的。我們用$\mathcal{O}=\bigcup_{s\in\mathcal{S}}\mathcal{O}_s$來表示所有可用的options的集合。 ::: :::info Our definition of options is crafted to make them as much like actions as possible while adding the possibility that they are temporally extended. Because options terminate in a well defined way, we can consider sequences of them in much the same way as we consider sequences of actions. We can also consider policies that select options instead of actions, and we can model the consequences of selecting an option much as we model the results of an action. Let us consider each of these in turn. ::: :::success 我們對options的定義聚焦在讓它們盡可能的像是actions，同時增加它們時間延伸的可能性。因為options的終止是有明確定義的，所以我們把options的序列像actions的序列一樣的考慮。我們還可以考慮選擇options的policy而不是actions的policy，而且我們可以模擬選擇一個option的結果，像我們選擇action一樣。讓我們依序來考慮這些事。 ::: :::info Given any two options a and b, we can consider taking them in sequence, that is, we can consider first taking a until it terminates, and then b until it terminates (or omitting b altogether if a terminates in a state outside of b’s initiation set). We say that the two options are composed to yield a new option, denoted ab, corresponding to this way of behaving. The composition of two Markov options will in general be semi-Markov, not Markov, because actions are chosen differently before and after the first option terminates. The composition of two semi-Markov options is always another semi-Markov option. Because actions are special cases of options, we can also compose them to produce a deterministic action sequence, in other words, a classical macro-operator. ::: :::success 給定任意兩個options $a$、$b$，我們可以考慮依序使用它們，也就是，我們可以想成首先採用$a$，然後直到它終止，接著採用$b$，然後直到它終止(或者如果$a$是在$b$的啟動集合(initiation set)之外的state終止，那就忽略$b$)。我們假設這兩個options結合產生一個新的option，以$ab$來表示，對應這樣的行為模式。兩個Markov options的組成通常會是semi-Markov，而不是Markov，因為在第一個option終止的前後所選擇的actions是不同的。兩個semi-Markov options組成的option始終為semi-Markov option。因為actions是options的特殊情況，我們也可以把它們組合起來產生一個deterministic action sequence，換句話說，就是一個經典的[巨集運算子](http://terms.naer.edu.tw/detail/13819690/)。 ::: :::info More interesting for our purposes are policies over options. When initiated in a state $s_t$, the Markov policy over options $\mu: \mathcal{S} \times \mathcal{O} \to [0,1]$ selects an option $o \in \mathcal{O}_{s_t}$ according to probability distribution $\mu(s_t, \dot)$. The option $o$ is then taken in $s_t$ , determining actions until it terminates in $s_{t+k}$, at which time a new option is selected, according to $\mu(s_{t+k}, \dot)$, and so on. In this way a policy over options, $\mu$, determines a conventional policy over actions, or flat policy, $\pi=flat(\mu)$. Henceforth we use the unqualified term policy for policies over options, which include flat policies as a special case. Note that even if a policy is Markov and all of the options it selects are Markov, the corresponding flat policy is unlikely to be Markov if any of the options are multi-step (temporally extended). The action selected by the flat policy in state sτ depends not just on sτ but on the option being followed at that time, and this depends stochastically on the entire history htτ since the policy was initiated at time $t$.^4^ By analogy to semi-Markov options, we call policies that depend on histories in this way semi-Markov policies. Note that semi-Markov policies are more specialized than nonstationary policies. Whereas nonstationary policies may depend arbitrarily on all preceding events, semi-Markov policies may depend only on events back to some particular time. Their decisions must be determined solely by the event subsequence from that time to the present, independent of the events preceding that time. ::: :::success 對我們的目的而言，更有趣的是對options的policies。當在state $s_t$啟動的時候，options $\mu: \mathcal{S} \times \mathcal{O} \to [0,1]$的Markov policy根據機率分佈$\mu(s_t, \dot)$選擇一個option $o \in \mathcal{O}_{s_t}$。然後在$s_t$的時候選中option $o$，確定action一直到$s_{t+k}$的時候終止，在這的時候，再根據$\mu(s_{t+k}, \dot)$選擇一個新的option，依此類推。這樣，options的policy，$\mu$，確定了actions的conventional policy(常規策略?)，或是flat policy，$\pi=flat(\mu)$。現在開始，我們使用unqualified term policy做為options的policies，包含屬於特例的flat policies。注意到，即使某個policy是Markov，而且所有它所選擇的options也都是Markov，但如果任一個option為multi-step(時間延伸，temporally extended)，那相對應的flat policy不大可能會是Markov。在state $s_\tau$利用flat policy所選擇的action並不只取決於$s_\tau$，還有那個時間下的option，而這個隨機性是取決於整個history $h_{t\tau}$，因為這個policy是在時間$t$所啟動的^4^。以此類推至semi-Markov options，我們稱這種取決於histories的policy為semi-Markov policies。注意到，semi-Markov policies比起nonstationary policies(非固定策略，非平穩策略?)更具專業性。nonstationary policies也許取決於前面任意的events(事件)，但是semi-Markov policies可以只取決於回到某個特定時間點的events。它們的決策必需完全的由該時間至現在的事件子序列決定，跟該時間點之前的事件完全無關。 ::: :::info ^4^ For example, the options for picking up an object and putting down an object may specify different actions in the same intermediate state; which action is taken depends on which option is being followed. ^4^ 舉例來說，用於撿起一個目標與放下一個目標的options可以在相同的中間狀態內指定不同的actions；選擇那一個action取決於依循那一個option。 ::: :::info These ideas lead to natural generalizations of the conventional value functions for a given policy. We define the value of a state $s\ in \mathcal{S}$ under a semi-Markov flat policy $\pi$ as the expected return given that $\pi$ is initiated in $s$: $$ V^\pi(s) := E \left\{r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dots \vert \varepsilon (\pi,s,t) \right\} $$ ,where $\epsilon(\pi,s,t)$ denotes the event of $\pi$ being initiated in $s$ at time $t$. The value of a state under a general policy $\mu$ can then be defined as the value of the state under the corresponding flat policy: $V^\mu(s) := V^{flat(\mu)}(s), \forall s \in \mathcal{S}$. Action-value functions generalize to option-value functions. We define $Q^{\mu}(s, o)$, the value of taking option $o$ in state $s \in \mathcal{I}$ under policy $\mu$, as $$ Q^\mu(s, o) := E \left\{r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dots \vert \varepsilon (o\mu,s,t) \right\} \tag{5} $$ ,where $o\mu$, the composition of $o$ and $\mu$, denotes the semi-Markov policy that first follows $o$ until it terminates and then starts choosing according to $\mu$ in the resultant state. For semi-Markov options, it is useful to define $\varepsilon(o,h,t)$, the event of $o$ continuing from $h$ at time $t$, where $h$ is a history ending with $s_t$. In continuing, actions are selected as if the history had preceded $s_t$ . That is, $a_t$ is selected according to $o(h, \dot)$, and $o$ terminates at $t+1$ with probability $\beta(ha_t r_{t+1} s_{t+1})$; if $o$ does not terminate, then $a_{t+1}$ is selected according to $o(ha_tr_{t+1}s_{s+1}, \dot)$, and so on. With this definition, (5) also holds where $s$ is a history rather than a state. ::: :::success 這樣的想法導致對給定policy的conventional value functions的自然概括。我們定義在semi-Markov flat policy $\pi$底下的state $s\ \in \mathcal{S}$，給定$\pi$啟動於$s$的價值為期望收益(expected return)為： $$ V^\pi(s) := E \left\{r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dots \vert \varepsilon (\pi,s,t) \right\} $$ 其中$\epsilon(\pi,s,t)$表示$\pi$在時間$t$啟動於$s$的事件。然後，在一般的策略$\pi$(general policy)下的state的value可以被定義為相對應於flag policy的state的value：$V^\mu(s) := V^{flat(\mu)}(s), \forall s \in \mathcal{S}$。action-value functions也可以泛化為option-value functions。我們定義$Q^{\mu}(s, o)$，也就是在policy $\mu$下，於state $s \in \mathcal{I}$選擇option $o$的value為： $$ Q^\mu(s, o) := E \left\{r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dots \vert \varepsilon (o\mu,s,t) \right\} \tag{5} $$ 其中$o\mu$，這個組合表示semi-Markov policy首先依著option $o$直到這個option終止，然後根據$\mu$在結果狀態下進行選擇。對於semi-Markov options，定義$\varepsilon(o,h,t)$是有幫助的，是option $o$(event)在時間$t$從$h$延伸的事件，其中$h$是以$s_t$結束的history。持續地，選擇actions，就像history在$s_t$之前一樣。也就是，$a_t$的選擇是根據$o(h, \dot)$，而$o$在$t+1$的時候以$\beta(ha_t r_{t+1} s_{t+1})$的機率終止；如果$o$沒有終止，那$a_{t+1}$就會根據$o(ha_tr_{t+1}s_{s+1}, \dot)$選擇，依此類推。用這個定義，方程式(5)在$s$是個history的時候依然會成立。 ::: :::info This completes our generalization to temporal abstraction of the concept of value functions for a given policy. In the next section we similarly generalize the concept of optimal value functions. ::: :::success 這樣就完成對於給定policy的value functions的concept的時間抽象的概括。下一節我們會類似地概括optimal value functions。 ::: ## 3. SMDP (option-to-option) methods :::info Options are closely related to the actions in a special kind of decision problem known as a semi-Markov decision process,or SMDP(e.g., see [58]). In fact, any MDP with a fixed set of options is an SMDP, as we state formally below. Although this fact follows moreor less immediately from definitions, we present it as a theorem to highlight it and state explicitly its conditions and consequences ::: :::success option與被稱為semi-MDP(SMDP)的特殊決策問題中的action密切相關(e.g., see [58])。事實上，任何擁有固定的options的MDP都是SMDP，一如我們下面正式說明。儘管這事實或多或少是從定義得出，但我們將其做為一個定理來強調，並明確的說明其條件與重要性。 ::: :::info **Theorem 1** (MDP + Options = SMDP).For any MDP, and any set of options defined on that MDP, the decision process that selects only among those options, executing each to termination, is an SMDP. Proof(Sketch). An SMDP consists of (1) a set of states, (2) a set of actions, (3) for each pair of state and action, an expected cumulative discounted reward, and (4) a well-defined joint distribution of the next state and transit time. ::: :::success **Theorem 1** (MDP + Options = SMDP)。對於任意的MPD，任意的定義於該MDP的options set，其僅於這些options選擇的決策過程(執行每一個options直到終止)是為SMDP。 Proof(Sketch)。SMDP包含： (1) states的集合， (2) actions的集合， (3) 每一個state-action pair所預期的累積折扣報酬，與 (4) 定義明確的next state與transit time的聯合分佈 ::: :::info In our case, the set of states is $S$, and the set of actions is the set of options. The expected reward and the next-state and transit-time distributions are defined for each state and option by the MDP and by the option’s policy and termination condition, $\pi $and $\beta$. These expectations and distributions are well defined because MDPs are Markov and the options are semi-Markov; thus the next state, reward, and time are dependent only on the option and the state in which it was initiated. The transit times of options are always discrete, but this is simply a special case of the arbitrary real intervals permitted in SMDPs. ::: :::success 在我們的情況中，states的集合為$\mathcal{S}$，actions的集合就是options的集合。每一個state、option的期望報酬(expected reward)與next-state、transi-time的分佈由MDP與option的policy、termination condition(終止條件)，$\pi, \beta$所定義。這些期望與分佈定義明確，因為MDPs是Markov，而options為semi-Markov；因此，next state、reward與時間僅取決於option及其啟動的state。options的transit times([中轉時間](http://terms.naer.edu.tw/detail/13872032/))始終為離散的，但是這只是SMDPs所允許的任意[實區間](http://terms.naer.edu.tw/detail/13849293/)的一種特殊情況。 ::: :::info This relationship among MDPs, options, and SMDPs provides a basis for the theory of planning and learning methods with options. In later sections we discuss the limitations of this theory due to its treatment of options as indivisible units without internal structure, but in this section we focus on establishing the benefits and assurances that it provides. We establish theoretical foundations and then survey SMDP methods for planning and learningwith options. Although our formalism is slightly different, these results are in essence taken or adapted from prior work (including classical SMDP work and [5,44,52–57,65–68,71,74,75]). A result very similar to Theorem 1 was proved in detail by Parr [52]. In Sections 4–7 we present new methods that improve over SMDP methods. ::: :::success MPDs、options與SMDPs之間的這種關聯為使用options的planning與learning方法提供一個理論基礎。後續的章節中會討論到這個理論的侷限性，這是因為options被視為沒有內部結構的不可分割單元，但是這一章我們會專注在確定option所提供的好處跟保證。我們建立理論基礎，然後檢視這個SMDP methods(使用options的planning與learning)。儘管形式有些許的不同，但這些結果本質上是取自或改編自先前的研究作業(包含經理的SMDP研究作業與[5,44,52–57,65–68,71,74,75])。Parr [52]的詳細證明與Theorem 1的結果非常相似。在4-7章我們提到優於SMDP的新方法。 ::: :::info Planning with options requires a model of their consequences. Fortunately, the appropriate form of model for options, analogous to the $r_s^a$ and $p^a_{ss'}$ defined earlier for actions, is known from existing SMDP theory. For each state in which an option may be started, this kind of model predicts the state in which the option will terminate and the total reward received along the way. These quantities are discounted in a particular way. These quantities are discounted in a particular way. For any option $o$, let $\mathcal{E}(o, s, t)$ denote the event of $o$ being initiated in state $s$ at time $t$. Then the reward part of the model of$o$ for any state $s \in \mathcal{S}$ is $$ r^o_s = E\left\{ r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{k-1} r_{t+k} \vert \mathcal{E}(o, s, t) \right\} \tag{6} $$ where $t+k$ is the random time at which $o$ terminates. The state-prediction part of the model of $o$ for state $s$ is: $$ p^o_{ss'} = \sum_{k=1}^{\infty} p(s', k)\gamma^k \tag{7} $$ for all $s' \in \mathcal{S}$, where $p(s', k)$ is the probability that the option terminates in $s'$ after $k$ steps. Thus, $p^o_{ss'}$ is a combination of the likelihood that $s'$ is the state in which $o$ terminates together with a measure of how delayed that outcome is relative to $\gamma$. We call this kind of model a multi-time model \[54,55\] because it describes the outcome of an option not at a single time but at potentially many different times, appropriately combined.^5^ ::: :::success 使用otpions的planning需要其[推論](http://terms.naer.edu.tw/detail/2113462/)的模型。很幸運的，從現有已知的SMDP理論可以知道，用於options的那種適當的模型形式是類似於前面為actions所定義的$r_s^a$與$p^a_{ss'}$。對於每一個可能是啟動option的state來說，這種類型的模型會預測(predict)終止option的state以及沿途接收的total rewards。這些數量會以特定的方式計算折扣。對於任意的option $o$，假設$\mathcal{E}(o, s, t)$表示在time $t$，$o$這個事件啟動於state $s$。那麼，對於任意state $s \in \mathcal{S}$，其$o$的模型(model)的reward部份為： $$ r^o_s = E\left\{ r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{k-1} r_{t+k} \vert \mathcal{E}(o, s, t) \right\} \tag{6} $$ 其中$t+k$是option $o$終止的一個隨機時間。對於state $s$，其option $o$的model的state-prediction(狀態預測)的部份為： $$ p^o_{ss'} = \sum_{k=1}^{\infty} p(s', k)\gamma^k \tag{7} $$ 上面的式子是$\forall s' \in \mathcal{S}$，其中$p(s', k)$是option $o$在$s'$經過$k$個steps之後終止的機率。因此，$p^o_{ss'}$是一種關於state $s'$為option $o$的終止狀態的[可能性](http://terms.naer.edu.tw/detail/366257/)與其結果(outcome)相對於$\gamma$的延遲程度的度量，這兩者之間的組合。我們稱這種類型的模型為multi-time model，因為它並不是在單一個時間點上去說明一個option的結果，而是在很多可能的不同的時間點上的適當組合。^5^ ::: :::info ^5^ Note that this definition of state predictions for options differs slightly from that given earlier for actions. Under the new definition, the model of transition from state s to s0 for an action a is not simply the corresponding transition probability, but the transition probability times γ . Henceforth we use the new definition given by (7). ^5^ 注意到，對於options的狀態預測的定義跟先前給出的actions的定義有些許的不同。在這個新的定義之下，對於action $a$從state $s$到$s'$的轉移模型不單是給出相對應的轉移機率，而是轉移機率再乘上$\gamma$。後續，我們會使用(7)所給出的新的定義。 ::: :::info Using multi-time models we can write Bellman equations for general policies and options. For any Markov policy $\mu$, the state-value function can be written: $$ \begin{align} V^\mu(s) &= E \left\{r_{t+1} + \cdots + \gamma^{k-1}r_{t+k} + \gamma^kV^\mu(s_{t+k}) \vert \mathcal{E}(\mu, s, t) \right\} \\ &= \sum_{o \in \mathcal{O}_s} \mu(s, o) [r^o_s + \sum_{s'} p^o_{ss'}V^\mu(s')] \tag{8} \end{align} $$ which is a Bellman equation analogous to (2). The corresponding Bellman equation for the value of an option $o$ in state $s \in \mathcal{I}$ is: $$ \begin{align} Q^\mu(s, 0) &= E \left\{r_{t+1} + \cdots + \gamma^{k-1}r_{t+k} + \gamma^k V^\mu(s_{t+k}) \vert \mathcal{E}(o, s, t) \right\}, \\ &= E\left\{r_{t+1} + \cdots + \gamma^{k-1}r_{t+k} + \gamma^k \sum_{o' \in \mathcal{O}_s} \mu(s_{t+k}, o') Q^\mu(s_{t+k}, o') \vert \mathcal{E}(o, s, t) \right\} \\ &= r^o_s + \sum_{s'} p^o_{ss'} \sum_{o' \in \mathcal{O}_{s'}} \mu(s', o') Q^\mu(s', o') \tag{9} \end{align} $$ Note that all these equations specialize to those given earlier in the special case in which $\mu$ is a conventional policy and $o$ is a conventional action. Also note that $Q^\mu(s, o) = V^{o\mu}(s)$. ::: :::success 使用multi-time models，我們可以為general policies與options寫下Bellman equations。對於任意的Markov policy $\mu$，其state-value function可以寫為： $$ \begin{align} V^\mu(s) &= E \left\{r_{t+1} + \cdots + \gamma^{k-1}r_{t+k} + \gamma^kV^\mu(s_{t+k}) \vert \mathcal{E}(\mu, s, t) \right\} \\ &= \sum_{o \in \mathcal{O}_s} \mu(s, o) [r^o_s + \sum_{s'} p^o_{ss'}V^\mu(s')] \tag{8} \end{align} $$ 這是一個跟公式2很像的Bellman equation。對於option $o$於state $s \in \mathcal{I}$的value的相對應的Bellman equation為： $$ \begin{align} Q^\mu(s, 0) &= E \left\{r_{t+1} + \cdots + \gamma^{k-1}r_{t+k} + \gamma^k V^\mu(s_{t+k}) \vert \mathcal{E}(o, s, t) \right\}, \\ &= E\left\{r_{t+1} + \cdots + \gamma^{k-1}r_{t+k} + \gamma^k \sum_{o' \in \mathcal{O}_s} \mu(s_{t+k}, o') Q^\mu(s_{t+k}, o') \vert \mathcal{E}(o, s, t) \right\} \\ &= r^o_s + \sum_{s'} p^o_{ss'} \sum_{o' \in \mathcal{O}_{s'}} \mu(s', o') Q^\mu(s', o') \tag{9} \end{align} $$ 注意到，所有這些方程式都是專門用前面給出的那些特例用的，像那個$\mu$就是conventional policy，而$o$就是conventional action。還有啊，$Q^\mu(s, o) = V^{o\mu}(s)$。 ::: :::info Finally, there are generalizations of optimal value functions and optimal Bellman equations to options and to policies over options. Of course the conventional optimal value functions $V^*$ and $Q^*$ are not affected by the introduction of options; one can ultimately do just as well with primitive actions as one can with options. Nevertheless, it is interesting to know how well one can do with a restricted set of options that does not include all the actions. For example, in planning one might first consider only high-level options in order to find an approximate plan quickly. Let us denote the restricted set of options by $\mathcal{O}$ and the set of all policies selecting only from options in $\mathcal{O}$ by $\Pi(\mathcal{O})$. Then the optimal value function given that we can select only from $\mathcal{O}$ is $$ \begin{align} V^*_\mathcal{O}(s) & \overset{def}= \max_{\mu\in\Pi(\mathcal{O})} V^\mu(s) \\ &= \max_{o \in \mathcal{O}_s} E \left\{r_{t+1} + \cdots + \gamma^{k-1}r_{t+k} + \gamma^k V^*_{\mathcal{O}}(s_{t+k}) \vert \mathcal{E}(o, s, t) \right\} && \text{(where $k$ is the duration of $o$ when taken is $s$)} \\ &= \max_{o \in \mathcal{O}_s} \Big[ r^o_s + \sum_{s'}p^o_{ss'} V^*_\mathcal{O}(s')\Big] \tag{10} \\ &= \max_{o \in \mathcal{O}_s} E \left\{r + \gamma^k V^*_\mathcal{O}(s') \vert \mathcal{E}(o, s) \right\} \tag{11} \end{align} $$ where $\mathcal{E}(o, s)$ denotes option $o$ being initiated in state $s$. Conditional on this event are the usual random variables: $s'$ is the state in which $o$ terminates, $r$ is the cumulative discounted reward along the way, and $k$ is the number of time steps elapsing between $s$ and $s'$ . The value functions and Bellman equations for optimal option values are: $$ \begin{align} Q^*_\mathcal{O} & \overset{def}= \max_{\mu \in \Pi(\mathcal{O})} Q^\mu(s, o) \\ &= E \left\{r_{t+1} + \cdots + \gamma^{k-1}r_{t+k} + \gamma^k V^*_\mathcal{O}(s_{t+k}) \vert \mathcal{E}(o, s, t) \right\} && \text{(where $k$ is the duration of $o$ from $s$)} \\ &= E \left\{r_{t+1} + \cdots + \gamma^{k-1}r_{t+k} + \gamma^k \max_{o' \in \mathcal{O}_{s_{t+k}}} Q^*_\mathcal{O}(s_{t+k}, o') \vert \mathcal{E}(o, s, t) \right\} \\ &= r^o_s + \sum_{s'} p^o_{ss'} \max_{o' \in \mathcal{O}_{s'}} Q^*_\mathcal{O}(s', o') \\ &=E\left\{r + \gamma^k \max_{o' \in \mathcal{O}_{s'}} Q^*_\mathcal{O}(s', o')\vert \mathcal{E}(o, s) \right\} \tag{12} \end{align} $$ where $r$, $k$, and $s'$ are again the reward, number of steps, and next state due to taking $o \in \mathcal{O}_s$. ::: :::success 最後，我們把optimal value functions與optimal Bellman equation泛化至options與options的policies。想當然，conventional optimal value functions $V^*$與$Q^*$並不會受到我們引入options的影響；你最終還是可以像使用options一樣的使用actions。儘管如此，瞭解可以怎麼樣子的去處理一組受限的options(這組option不包括所有的actions)是很有趣的。舉例來說，在planning的時候，為了能夠找到快速的找到一個大致的規劃(approximate plan)，你可能只會先考慮那些high-level options。我們把那些受限的options的集合以$\mathcal{O}$來表示，然後用$\Pi(\mathcal{O})$來表示這些只會從$\mathcal{O}$來選擇options的策略(policy)集合。然後，只能從$\mathcal{O}$選擇的情況下所給出的optimal value function為： $$ \begin{align} V^*_\mathcal{O}(s) & \overset{def}= \max_{\mu\in\Pi(\mathcal{O})} V^\mu(s) \\ &= \max_{o \in \mathcal{O}_s} E \left\{r_{t+1} + \cdots + \gamma^{k-1}r_{t+k} + \gamma^k V^*_{\mathcal{O}}(s_{t+k}) \vert \mathcal{E}(o, s, t) \right\} && \text{(where $k$ is the duration of $o$ when taken is $s$)} \\ &= \max_{o \in \mathcal{O}_s} \Big[ r^o_s + \sum_{s'}p^o_{ss'} V^*_\mathcal{O}(s')\Big] \tag{10} \\ &= \max_{o \in \mathcal{O}_s} E \left\{r + \gamma^k V^*_\mathcal{O}(s') \vert \mathcal{E}(o, s) \right\} \tag{11} \end{align} $$ 其中$\mathcal{E}(o, s)$表示在state $s$所啟動的的那個option $o$。這個事件的條件通常是一個隨機變數：$s'$是option $o$終止的狀態，$r$為沿途所得的累積折扣報酬(discounted reward)，$k$則是從$s$到$s'$所經過的time steps。這個optimal option values的value functions與Bellman equations為： $$ \begin{align} Q^*_\mathcal{O} & \overset{def}= \max_{\mu \in \Pi(\mathcal{O})} Q^\mu(s, o) \\ &= E \left\{r_{t+1} + \cdots + \gamma^{k-1}r_{t+k} + \gamma^k V^*_\mathcal{O}(s_{t+k}) \vert \mathcal{E}(o, s, t) \right\} && \text{(where $k$ is the duration of $o$ from $s$)} \\ &= E \left\{r_{t+1} + \cdots + \gamma^{k-1}r_{t+k} + \gamma^k \max_{o' \in \mathcal{O}_{s_{t+k}}} Q^*_\mathcal{O}(s_{t+k}, o') \vert \mathcal{E}(o, s, t) \right\} \\ &= r^o_s + \sum_{s'} p^o_{ss'} \max_{o' \in \mathcal{O}_{s'}} Q^*_\mathcal{O}(s', o') \\ &=E\left\{r + \gamma^k \max_{o' \in \mathcal{O}_{s'}} Q^*_\mathcal{O}(s', o')\vert \mathcal{E}(o, s) \right\} \tag{12} \end{align} $$ 其中$r, k$與$s'$就是reward、經過的幾個step、因為選擇option $o \in \mathcal{O}_s$所產生的next state。 ::: :::info Given a set of options, $\mathcal{O}$, a corresponding optimal policy, denoted $\mu^*_\mathcal{O}$, is any policy that achieves $V^*_\mathcal{O}$, i.e., for which $V^{\mu^*_\mathcal{O}}(s) = V^*_{\mathcal{O}}(s)$ in all states $s \in \mathcal{S}$. If $V^*_\mathcal{O}$ and models of the options are known, then optimal policies can be formed by choosing in any proposition among the maximizing options in (10) or (11). Or, if $Q^*_\mathcal{O}$ is known, then optimal policies can be found without a model by choosing in each state $s$ in any proportion among the options o for which $Q^*_\mathcal{O}(s, o) = \max_{o'} Q^*_\mathcal{O}(s, o')$. In this way, computing approximations to $V^*_\mathcal{O}$ or $Q^*_\mathcal{O}$ become key goals of planning and learning methods with options. ::: :::success 給定一個options的集合，$\mathcal{O}$，一個相對應的optimal policy，以$\mu^*_\mathcal{O}$來表示，只能它是$V^*_\mathcal{O}$，那它就是optimal policy，即，在所有的states $s \in \mathcal{S}$，其$V^{\mu^*_\mathcal{O}}(s) = V^*_{\mathcal{O}}(s)$。如果$V^*_\mathcal{O}$與options的模型(models)為已知，那麼，你就可以利用公式10、11中的那些maximizing options，從中選擇任意的[命題](http://terms.naer.edu.tw/detail/2122596/)來形式optimal policies。或者，如果$Q^*_\mathcal{O}$為已知，那你不需要模型就可以找到optimal policies，只要透過在每個state $s$以任意的比例從那些可以達到$Q^*_\mathcal{O}(s, o) = \max_{o'} Q^*_\mathcal{O}(s, o')$的option $o$做選擇即可。如此一來，透過options來計算$V^*_\mathcal{O}$或是$Q^*_\mathcal{O}$的近似值就變成planning與learning兩個方法的關鍵目標。 ::: ### 3.1 SMDP planning :::info With these definitions, an MDP together with the set of options $\mathcal{O}$ formally comprises an SMDP, and standard SMDP methods and results apply. Each of the Bellman equations for options, (8), (9), (10), and (12), defines a system of equations whose unique solution is the corresponding value function. These Bellman equations can be used as update rules in dynamic-programming-like planning methods for finding the value functions. Typically, solution methods for this problem maintain an approximation of $V^*_\mathcal{O}(s)$ or $Q^*_\mathcal{O}(s, o)$ for all states $s \in \mathcal{S}$ and all options $o \in \mathcal{O}_s$. For example, synchronous value iteration (SVI) with options starts with an arbitrary approximation $V_0$ to $V^*_\mathcal{O}$ and then computes a sequence of new approximations $\left\{V_k\right\}$ by $$ V^k(s) = \max_{o \in \mathcal{O}_s} \Big[r^o_s + \sum_{s' \in \mathcal{S}} p^o_{ss'}V_{k-1}(s') \Big] \tag{13} $$ for all $s \in \mathcal{S}$. The option-value form of SVI starts with an arbitrary approximation $Q_0$ to $Q^*_\mathcal{O}$ and then computes a sequence of new approximations $\left\{Q_k\right\}$ by $$ Q^k(s, o) = r^o_s + \sum_{s' \in \mathcal{S}} p^o_{ss'}\max_{o' \in \mathcal{O}_{s'}} Q_{k-1}(s', o') $$ for all s ∈ S and o ∈ Os. Note that these algorithms reduce to the conventional value iteration algorithms in the special case that $\mathcal{O}=\mathcal{A}$. Standard results from SMDP theory guarantee that these processes converge for general semi-Markov options: $\lim_{k \to \infty} V_k=V^*_{\mathcal{O}}$ and $\lim_{k \to \infty} Q_k=Q^*_\mathcal{O}$, for any $\mathcal{O}$. ::: :::success 有這些定義之後，我們就可以正式地把MPD跟一組options $\mathcal{O}$結合成SMDP，來用用這個標準的SMDP方法並看看結果。公式8、9、10、12，每個用於options的Bellman equations，都定義了一個方程組，其唯一解就是對應的value function。這些Bellman equations都可以用來做為在dynamic-programming-like planning methods中找value functions的更新規劃。通常來說，該問題的求解方法會對所有的states $s \in \mathcal{S}$與所有的options $o \in \mathcal{O}_s$都維持在$V^*_\mathcal{O}(s)$或$Q^*_\mathcal{O}(s, o)$的近似值。舉例來說，使用options的synchronous value iteration (SVI)會以任意的近似值來做為開始來計算從$V_0$到$V^*_\mathcal{O}$，然後利用下面公式計算一系列的新的近似值$\left\{V_k\right\}$： $$ V^k(s) = \max_{o \in \mathcal{O}_s} \Big[r^o_s + \sum_{s' \in \mathcal{S}} p^o_{ss'}V_{k-1}(s') \Big] \tag{13} $$ 上面公式適用於所有的state $s \in \mathcal{S}$。SVI的option-value也是一樣以任意的近似值來做為開始來計算從$Q_0$到$Q^*_\mathcal{O}$，然後利用下面公式計算一系列的新的近似值$\left\{Q_k\right\}$： $$ Q^k(s, o) = r^o_s + \sum_{s' \in \mathcal{S}} p^o_{ss'}\max_{o' \in \mathcal{O}_{s'}} Q_{k-1}(s', o') $$ 上面公式適用於所有的$s \in \mathcal{S}$與$o \in \mathcal{O}_s$。注意到，這些演算法在$\mathcal{O}=\mathcal{A}$的這種特例情況下就會簡化成一般的value iteration algorithms。SMDP理論的標準結果會保證這些general semi-Markov options的收斂：對於所有的$\mathcal{O}$，其$\lim_{k \to \infty} V_k=V^*_{\mathcal{O}}$以及$\lim_{k \to \infty} Q_k=Q^*_\mathcal{O}$。 ::: :::info ![](https://hackmd.io/_uploads/H1lUDMrTY.png) Fig. 2. The rooms example is a gridworld environment with stochastic cell-to-cell actions and room-to-room hallway options. Two of the hallway options are suggested by the arrows labeled $o_1$ and $o_2$. The labels $G_1$ and $G-2$ indicate two locations used as goals in experiments described in the text. Fig. 2. 上面的rooms範例是一個gridworld的環境，有著隨機的cell-to-cell的actions以及room-to-room的走廊選項(hallyway options)。由標記為$o_1$、$o_2$的箭頭建議了兩個hallyway options。$G_1$、$G-2$則為文中描述的實驗中做為目標的的兩個位置。 ::: :::info The plans (policies) found using temporally abstract options are approximate in the sense that they achieve only V ∗ O, which may be less than the maximum possible, V ∗. On the other hand, if the models used to find them are correct, then they are guaranteed to achieve V ∗ O. We call this the value achievement property of planning with options. This contrasts with planning methods that abstract over state space, which generally cannot be guaranteed to achieve their planned values even if their models are correct. ::: :::success 使用時間抽象選項(temporally abstract options)所找出來的plans(policies)是近似值，意思就是說，可能得到的就是$V^*_\mathcal{O}$，而這可能會小於可能的最大值，也就是$V^*$。另一方面，如果用來找到它們的模型(model)是正確的，那就可以保證會得到$V^*_\mathcal{O}$。我們把這現象稱之為value achievement property of planning with options。這跟在狀態空間(state space)上抽象的planning methods形成對比，即使state space的模型正確，通常也無法保證能夠實現它們的planned values。 ::: :::info As a simple illustration of planning with options, consider the rooms example, a gridworld environment of four rooms as shown in Fig. 2. The cells of the grid correspond to the states of the environment. From any state the agent can perform one of four actions, up, down, left or right, which have a stochastic effect. With probability 2/3, the actions cause the agent to move one cell in the corresponding direction, and with probability 1/3, the agent moves instead in one of the other three directions, each with probability 1/9. In either case, if the movement would take the agent into a wall then the agent remains in the same cell. For now we consider a case in which rewards are zero on all state transitions. ::: :::success 做為一個用來說明使用options的planning的範例，我們來看看上面Fig. 2.的rooms example。網路中(grid)的單元格(cell)對應環境的狀態(states)。在任意的狀態下，agent都可以執行四個actions的其中之一，上、下、左或右，這些actions具有隨機效應。有2/3的機率，其actions會讓agent以相對應的方向移動一個單元格(cell)，有1/3的機率，agent會從另外三個方向去移動，每個方向的機率為1/9。任一個情況，只要移動會撞牆，那就停留在原地。我們現在考慮一種情況，所有的狀態轉移(state transitions)得到的reward皆為0。 ::: :::info In each of the four rooms we provide two built-in hallway options designed to take the agent from anywhere within the room to one of the two hallway cells leading out of the room. A hallway option’s policy π follows a shortest path within the room to its target hallway while minimizing the chance of stumbling into the other hallway. For example, the policy for one hallway option is shown in Fig. 3. The termination condition β(s) for each hallway option is zero for states s within the room and 1 for states outside the room, including the hallway states. The initiation set I comprises the states within the room plus the non-target hallway state leading into the room. Note that these options are deterministic and Markov, and that an option’s policy is not defined outside of its initiation set. We denote the set of eight hallway options by H. For each option o ∈ H, we also provide a priori its accurate model, ro s and po ss0 , for all s ∈ I and s0 ∈ S (assuming there is no goal state, see below). Note that although the transition models po ss0 are nominally large (order |I|×|S|), in fact they are sparse, and relatively little memory (order |I| × 2) is actually needed to hold the nonzero transitions from each state to the two adjacent hallway states^6^. ::: :::success 這其中的四個房間裡面，我們都有各有兩個走廊的選項(hallway options)，主要是讓agent可以從房間內的任意位置往其中一個走廊離開房間。hallway option的policy $\pi$會依循著從所在房間到目標走廊的最短路徑，同時最小化進到另一個走廊的機會。舉例來說，一個hallway option的policy如Fig. 3.所示。每一個hallway option的終止條件$\beta(s)$，如果是房間內的states $s$那就為0，如果是房間外的states(包在hallway states)的話那就為1。initiation set $\mathcal{I}$由房間內的狀態(states)以及通向房間的非目標走廊的狀態(non-target hallway state)所組成。注意到，這些的options是確定性的，而且為馬可夫(Markov)，且options的policy並不在其initiation set之外所定義。我們用$\mathcal{H}$來表示這八個hallway options的集合。對於每個option $o \in \mathcal{H}$，我們還為所有的$s \in \mathcal{I}$與$s' \in \mathcal{S}$提供了一個其$r^o_s$與$p^o_{ss'}$的priori accurate model(假設沒有goal state，見下說明)。另外注意的是，雖然這個transition models $p^o_{ss'}$看起來很大($\vert \mathcal{I} \vert \times \vert \mathcal{S} \vert$)，不過事實上它們是很稀疏的，只需要相對較小的記憶體(\vert \mathcal{I} \vert \times 2)來維持每個state到兩個相鄰的hallway states的非零轉移(nonzero transitions)^6^。 ::: :::info ![](https://hackmd.io/_uploads/Syd_FPU6t.png) Fig. 3. The policy underlying one of the eight hallway options Fig. 3. 八個hallway options中的一個可能的策略 ::: :::info ^6^ The off-target hallway states are exceptions in that they have three possible out-comes: the target hallway, themselves, and the neighboring state in the off-target room. ^6^ off-target hallway states是例外，因為它們有三個可能的結果：target hallway、它們本身、與off-target room中的相鄰狀態(state)。 ::: :::info ![](https://hackmd.io/_uploads/Bk1zWuLTK.png) Fig. 4. Value functions formed over iterations of planning by synchronous value iteration with primitive options (above) and with multi-step hallway options (below). The hallway options enabled planning to proceed room-by-room rather than cell-by-cell. The area of the disk in each cell is proportional to the estimated value of the state, where a disk that just fills a cell represents a value of 1.0. Fig. 4. 透過使用primitive options(上面)與使用multi-step hallway options(下面)搭配synchronous value iteration在planning的迭代過程中所形成的value function。hallway options讓planning能夠以room-by-room，而不是cell-by-cell來進行。每個cell裡面那個黑色圓盤的面積會與狀態(state)的估測值成正比，而那些只有一個單元格(cell)填入黑色圓盤的就代表其估值為1.0。 ::: :::info Now consider a sequence of planning tasks for navigating within the grid to a designated goal state, in particular, to the hallway state labeled $G_1$ in Fig. 2. Formally, the goal state is a state from which all actions lead to the terminal state with a reward of +1. Throughout this paper we discount with $\gamma=0.9$ in the rooms example. ::: :::success 現在我們來考慮一系列的用於在網格內導航到指定的目標狀態的planning tasks，特別是Fig. 2.那個標記為$G_1$的那個。形式上，目標狀態(goal state)是一個terminal state，所有的actions都往這個狀態(state)前進，到達之後你會得到reward + 1。論文中，這個rooms example內的discount，$\gamma=0.9$。 ::: :::info As a planning method, we used SVI as given by (13), with various sets of options $\mathcal{O}$O. The initial value function V0 was 0 everywhere except the goal state, which was initialized to its correct value, $V_0(G_1)=1$, as shown in the leftmost panels of Fig. 4. This figure contrasts planning with the original actions $\mathcal{O} = \mathcal{A}$ and planning with the hallway options and not the original actions $\mathcal{O} = \mathcal{H}$. The upper part of the figure shows the value function after the first two iterations of SVI using just primitive actions. The region of accurately valued states moved out by one cell on each iteration, but after two iterations most states still had their initial arbitrary value of zero. In the lower part of the figure are shown the corresponding value functions for SVI with the hallway options. In the first iteration all states in the rooms adjacent to the goal state became accurately valued, and in the second iteration all the states became accurately valued. Although the values continued to change by small amounts over subsequent iterations, a complete and optimal policy was known by this time. Rather than planning step-by-step, the hallway options enabled the planning to proceed at a higher level, room-by-room, and thus be much faster. ::: :::success 做為一個planning method，我們用公式13所給出的SVI搭配各種的options的集合，$\mathcal{O}$。初始的value function $V_0$全部都是0，除了goal state的初始值為它的正確值，$V_0(G_1)=1$，如Fig. 4.的最左圖所示。這個圖把兩種planning的方法(原始的actions，即$\mathcal{O} = \mathcal{A}$與hallway options，即$\mathcal{O} = \mathcal{H}$)做了比較。上半部的部份指出的是使用原始的actions在SVI兩次迭代之後得到的value function。準確值狀態的區域在每次迭代的時候會移動一個單元格(cell)，不過在兩次迭代之後，多數的狀態仍然是初始值，也就是0。下半部份說明的則是使用hallway options在SVI的相對應的value functions。在第一次的迭代中，靠近目標狀態的房間內的那些狀態的都有了準確估值，而在第二次迭代之後，所有的狀態都得到準確的估值。儘管這些狀態的值在後續的迭代中都還是持續的少量變化，不過這時候也已經知道完整的optimal policy。比起planning step-by-step，hallway options能夠讓整個planning用一個比較更為high level、room-by-room的方式來執行，因為速度快很多。 ::: :::info This example is a particularly favorable case for the use of multi-step options because the goal state is a hallway, the target state of some of the options. Next we consider a case in which there is no such coincidence, in which the goal lies in the middle of a room, in the state labeled $G_2$ in Fig. 2. The hallway options and their models were just as in the previous experiment. In this case, planning with (models of) the hallway options alone could never completely solve the task, because these take the agent only to hallways and thus never to the goal state. Fig. 5 shows the value functions found over five iterations of SVI using both the hallway options and the primitive options corresponding to the actions (i.e., using $\mathcal{O} = \mathcal{A} \bigcup \mathcal{H}$). In the first two iterations, accurate values were propagated from $G_2$ by one cell per iteration by the models corresponding to the primitive options. After two iterations, however, the first hallway state was reached, and subsequently room-to-room planning using the multi-step hallway options dominated. Note how the state in the lower right corner was given a nonzero value during iteration three. This value corresponds to the plan of first going to the hallway state above and then down to the goal; it was overwritten by a larger value corresponding to a more direct route to the goal in the next iteration. Because of the multi-step options, a close approximation to the correct value function was found everywhere by the fourth iteration; without them only the states within three steps of the goal would have been given non-zero values by this time. ::: :::success 這個例子對使用multi-step options特別有利，因為目標狀態是hallway，也是某些options的target state。接下來我們要來弄一個沒那麼剛好的範例，目標狀態(goal state)會在房間的中間，像是Fig. 2.中那個標記為$G_2$的狀態(state)。hallway options與models都跟剛剛的實驗都一樣。這種情況下，如果只是用hallway options搭配planning的話，可能一輩子無法解這個任務，因為這些options只會帶agent去走廊(hallway)，永遠到不了目標狀態。Fig. 5.說明的是，使用hallway options與對應actions的primitive options，即$\mathcal{O} = \mathcal{A} \bigcup \mathcal{H}$，在SVI五次迭代之後找到的value functions。在前兩次的迭代中，精確的估值透過與primitive options相對應的model，從$G_2$在每次的迭代中傳播一個單元格(cell)。可以注意到右下角的那些狀態(state)在第三次的迭代中是如何被賦予非零值的。這個值是對應於第一次到達hallway state上方然後再往下到達目標的plan；它在下一次迭代中被一個更大的值覆寫，這值對應一個能夠更直接到目標狀態的路由。因為我們採用multi-step options，在第四次迭代之後大概四處都已經找到接近正確value function的近似值；如果沒有它們，那這時候就只有距離目標三步之內的狀態(states)才會有非零值。 ::: :::info ![](https://hackmd.io/_uploads/ryUoKjLpK.png) Fig. 5. An example in which the goal is different from the subgoal of the hallway options. Planning here was by SVI with options $\mathcal{O} = \mathcal{A} \bigcup \mathcal{H}$. Initial progress was due to the models of the primitive options (the actions), but by the third iteration room-to-room planning dominated and greatly accelerated planning. Fig. 5. 目標(goal)與hallway options的subgoal不同的一個例子。這邊的planning是SVI搭配$\mathcal{O} = \mathcal{A} \bigcup \mathcal{H}$。一開始是由primitive options(actions)的model來實作，在第三次迭代就由room-to-room planning來接手，大大的加快planning的速度。 ::: :::info We have used SVI in this example because it is a particularly simple planning method which makes the potential advantage of multi-step options clear. In large problems, SVI is impractical because the number of states is too large to complete many iterations, often not even one. In practice it is often necessary to be very selective about the states updated, the options considered, and even the next states considered. These issues are not resolved by multi-step options, but neither are they greatly aggravated. Options provide a tool for dealing with them more flexibly. ::: :::success 我們在這個例子中使用SVI，因為它是一種特別簡單的planning method，特別能夠帶出multi-step options的潛在優點。在大型問題中，SVI是不切實際的，因為你的states的數量太大了，以致於無法完成很多次的迭代，而且通常是連一次迭代都不會有。實務上，通用需要對更新的狀態(state updated)、考慮的選項(options considered)、甚至是考慮的下一個狀態非常有選擇性。這些問題並沒有辦法透過multi-step options來解決，不過情況也不會惡化就是。options提供一個在處理這類問題上更靈活的工具。 ::: :::info Planning with options is not necessarily more complex than planning with actions. For example, in the first experiment described above there were four primitive options and eight hallway options, but in each state only two hallway options needed to be considered. In addition, the models of the primitive options generated four possible successors with non-zero probability whereas the multi-step options generated only two. Thus planning with the multi-step options was actually computationally cheaper than conventional SVI in this case. In the second experiment this was not the case, but the use of multi-step options did not greatly increase the computational costs. In general, of course, there is no guarantee that multi-step options will reduce the overall expense of planning. For example, Hauskrecht et al. [26] have shown that adding multi-step options may actually slow SVI if the initial value function is optimistic. Research with deterministic macro-operators has identified a related “utility problem” when too many macros are used (e.g., see [20,23, 24,47,76]). Temporal abstraction provides the flexibility to greatly reduce computational complexity, but can also have the opposite effect if used indiscriminately. Nevertheless, these issues are beyond the scope of this paper and we do not consider them further. ::: :::success 用planning + options不見得會比planning + actions來的複雜。舉例來說，在上面描述的第一個實驗中，就有四個primitive options(也就是actions)與八個hallway options，不過你在每個state的時候，其實需要考慮的就只有兩個hallway options。此外，primitive options的模型(model)會以非零的機率產生四個可能的後續狀態，但是multi-step options只會有兩個。所以厚，planning + multi-step options在這種情況下的計算花費其實是比傳統的SVI還要來的便宜多了。在第二個實驗中的情況並非如此，不過使用multi-step options並沒有造成計算成本多大的增加。通常來說，並沒有辦法保證使用multi-step options能夠降低總體planning的花費成本。舉例來說，Hauskrecht et al. [26]就已經說明，如果initial value function是[樂觀(optimistic)](http://terms.naer.edu.tw/detail/17293335/)的那增加multi-step options實際上可能是會拖累SVI的速度的。過去對deterministic macro-operators的研究已經發現，當你使用過多的macros的時候，就會產生相關的"utility problem"([效用](http://terms.naer.edu.tw/detail/275499/)問題？)。時間抽象提供了一個大幅度降低計算複雜度的靈活性，不過如果你亂用的話，可能還是會有反效果。儘管如此，這些問題已經超過這篇論文的範圍，我們不再步一進的討論。 ::: ### 3.2 SMDP value learning :::info The problem of finding an optimal policy over a set of options $\mathcal{O}$ can also be addressed by learning methods. Because the MDP augmented by the options is an SMDP, we can apply SMDP learning methods [5,41,44,52,53]. Much as in the planning methods discussed above, each option is viewed as an indivisible, opaque unit. When the execution of option $o$ is started in state $s$, we next jump to the state $s'$ in which $o$ terminates. Based on this experience, an approximate option-value function $Q(s, o)$ is updated. For example, the SMDP version of one-step Q-learning [5], which we call SMDP Q-learning, updates after each option termination by $$ Q(s, o) \leftarrow Q(s, o) + \alpha \big[r + \gamma^k \max_{o' \in \mathcal{O}_{s'}} Q(s', o') - Q(s, o) \big] $$ where $k$ denotes the number of time steps elapsing between $s$ and $s'$, $r$ denotes the cumulative discounted reward over this time, and it is implicit that the step-size parameter $\alpha$ may depend arbitrarily on the states, option, and time steps. The estimate $Q(s, o)$ converges to $Q^*(s, o)$ for all $s \in \mathcal{S}$ and $o \in \mathcal{O}$ under conditions similar to those for conventional Qlearning [52], from which it is easy to determine an optimal policy as described earlier. ::: :::success 在一組options $\mathcal{O}$上找出optimal policy這個問題也可以透過學習方法來解決。因為用options增強過的MPD就變身成為SMDP，所以我們可以來用用SMDP這個學習方法[5,41,44,52,53]。就像上面討論過的planning methods那般，每個options都被視為[不可分割](http://terms.naer.edu.tw/detail/2908229/)(indivisible)、[不透明](http://terms.naer.edu.tw/detail/17292027/)(opaque)的單元。當option $o$在state $s$開始執行，下一步就跳到state $s'$，然後option $o$終止。基於這個經驗，我們就可以更新一個近似的option-value function $Q(s, o)$。舉例來說，SMDP版本的one-step Q-learning[5]，我們稱之為SMDP Q-learning，在每個option終止的時候透過下面公式更新： $$ Q(s, o) \leftarrow Q(s, o) + \alpha \big[r + \gamma^k \max_{o' \in \mathcal{O}_{s'}} Q(s', o') - Q(s, o) \big] $$ 其中$k$表示從$s$到$s'$經過幾個time steps，$r$表示在這時間上的累積折扣報酬(cumulative discounted reward)，這公式隱隱的指出，step-size $\alpha$這個參數是可以任意地取決於你的狀態(state)、選項(option)與時步(time steps)。在類似傳統Q-learning[52]的條件下，我們可以說，對於所有的state $s \in \mathcal{S}$與options $o \in \mathcal{O}$，其估測值$Q(s, o)$是可以收斂到$Q^*(s, o)$，就像前面說過的那樣，這是很容易確定的optimal policy。 ::: :::info As an illustration, we applied SMDP Q-learning to the rooms example with the goal at $G_1$ and at $G_2$ (Fig. 2). As in the case of planning, we used three different sets of options, $\mathcal{A}$, $\mathcal{H}$, and $\mathcal{A} \bigcup \mathcal{H}$. In all cases, options were selected from the set according to an $\epsilon-$greedy method. That is, options were usually selected at random from among those with maximal option value (i.e., ot was such that $o_t$會使得$Q(s_t, o_t) = \max_{o \in \mathcal{O}_{st}} Q(s_t, o)$), but with probability $\epsilon$ the option was instead selected randomly from all available options. The probability of random action, $\epsilon$, was 0.1 in all our experiments. The initial state of each episode was in the upper-left corner. Fig. 6 shows learning curves for both goals and all sets of options. In all cases, multi-step options enabled the goal to be reached much more quickly, even on the very first episode. With the goal at $G_1$, these methods maintained an advantage over conventional Q-learning throughout the experiment, presumably because they did less exploration. The results were similar with the goal at $G_2$, except that the H method performed worse than the others in the long term. This is because the best solution requires several steps of primitive options (the hallway options alone find the best solution running between hallways that sometimes stumbles upon $G_2$). For the same reason, the advantages of the $\mathcal{A} \bigcup \mathcal{H}$ method over the $\mathcal{A}$ method were also reduced. ::: :::success 做為一個說明，我們會把SMDP Q-learning拿來用在Fig. 2.那個以$G_1$、$G_2$做為目標的rooms example。就跟planning一樣，我們使用三個不同的options的集合，$\mathcal{A}$、$\mathcal{H}$以及$\mathcal{A} \bigcup \mathcal{H}$。在所有的情況下，options會根據$\epsilon-$greedy從集合中去選擇。也就是說，options通常會是從這一堆有著maximal option value的options裡面隨機選一個(即，$o_t$會使得$Q(s_t, o_t) = \max_{o \in \mathcal{O}_{st}} Q(s_t, o)$)，不過會有$\epsilon$的機率從所有可能的options隨機選一個。在我們所有的實驗中，這個$\epsilon$通通設置為0.1。每個episode的初始狀態(initial state)就是左上角。Fig. 6.呈現的是目標與所有options集合的學習曲線。所有的情況下，multi-step options能夠更快的到達目標位置，甚至是在第一個episode就可以達標。以$G_1$為目標，整個實驗過程中，這些方法維持該有的水準，優於傳統的Q-learning，大概是因為做比較少的探索。以$G_2$為目標的結果是類似的，除了$\mathcal{H}$在長時間表現比其它的方法較差之外。這是因為最佳解需要來幾次的primitive options的time steps(單純的hallway options就可以找到在hallways之間執行的最佳解，有時候會遇然發現$G_2$)。出於相同的原因，$\mathcal{A} \bigcup \mathcal{H}$相對$\mathcal{A}$的優勢也降低了。 ::: :::info ![](https://hackmd.io/_uploads/HkqP4jwaF.png) Fig. 6. Performance of SMDP Q-learning in the rooms example with various goals and sets of options. After 100 episodes, the data points are averages over groups of 10 episodes to make the trends clearer. The step size parameter was optimized to the nearest power of 2 for each goal and set of options. The results shown used α = 1/8 in all cases except that with O = H and G1 (α = 1/16), and that with O = A ∪ H and G2 (α = 1/4). Fig. 6. 在各種目標與options集合的rooms example上，SMDP Q-learning的效能。在100個episodes之後，圖表上的data points以每10個episodes為群組的平均，以此讓整個趨勢更為明顯。對每個目標與options的集合，其step size優化為最接近的2次方(power of 2)。結果顯示，除了$\mathcal{O}=\mathcal{H}$ + $G_1(\alpha=1/16)$、以及$\mathcal{O} = \mathcal{A} \bigcup \mathcal{H}$ + $G_2(\alpha = 1/4)$之外，其它的情況皆使用$\alpha = 1/8$ :::