Actor-Critic Algorithms

# Actor-Critic Algorithms ###### tags: `論文翻譯` `deeplearning` `rl` `Reinforcement Learning` [TOC] ## 說明區塊如下分類，原文區塊為藍底，翻譯區塊為綠底，部份專業用語翻譯參考國家教育研究院 :::info 原文 ::: :::success 翻譯 ::: :::warning 任何的翻譯不通暢部份都請留言指導 ::: :::danger * [paper hyperlink](https://proceedings.neurips.cc/paper/1999/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf) ::: ## Abstract :::info We propose and analyze a class of actor-critic algorithms for simulation-based optimization of a Markov decision process over a parameterized family of randomized stationary policies. These are two-time-scale algorithms in which the critic uses TD learning with a linear approximation architecture and the actor is updated in an approximate gradient direction based on information provided by the critic. We show that the features for the critic should span a subspace prescribed by the choice of parameterization of the actor. We conclude by discussing convergence properties and some open problems. ::: :::success 我們提出並分析一種actor-critic algorithms，這是用於參數化(parameterized)隨機[平穩策略](http://terms.naer.edu.tw/detail/2125290/)上的的馬可夫決策過程(Markov decision process)的模擬(simulation-based)最佳化所用的演算法。這是一種兩個[時間尺度](http://terms.naer.edu.tw/detail/2126364/)的演算法，critic使用有著線性近似架構的TD learning，然後actor則是根據critic提供的信息在近似梯度的方向上做更新。我們會說明critic的特徵應該是跨越一個子空間，這子空間是由actor的[參數化](http://terms.naer.edu.tw/detail/254345/)選擇所[規定](http://terms.naer.edu.tw/detail/17615700/)。我們會透過討論收斂性與某些未解的問題來做個結論。 ::: ## 1 Introduction :::info The vast majority of Reinforcement Learning (RL) [9] and Neuro-Dynamic Programming (NDP) [1] methods fall into one of the following two categories: * (a) Actor-only methods work with a parameterized family of policies. The gradient of the performance, with respect to the actor parameters, is directly estimated by simulation, and the parameters are updated in a direction of improvement [4, 5, 8, 13]. A possible drawback of such methods is that the gradient estimators may have a large variance. Furthermore, as the policy changes, a new gradient is estimated independently of past estimates. Hence, there is no "learning," in the sense of accumulation and consolidation of older information. * (b) Critic-only methods rely exclusively on value function approximation and aim at learning an approximate solution to the Bellman equation, which will then hopefully prescribe a near-optimal policy. Such methods are indirect in the sense that they do not try to optimize directly over a policy space. A method of this type may succeed in constructing a "good" approximation of the value function, yet lack reliable guarantees in terms of near-optimality of the resulting policy. ::: :::success 多數的強化學習(RL)[9]與神經-動態規劃(Neuro-Dynamic Programming (NDP))[1]都屬於下面兩類之一： * (a) Actor-only，這可以用於參數化這類的策略。透過模擬直接估測其效能相對於actor parameters的梯度，然後再沿著改進的方向來更新參數[4, 5, 8, 13]。這類方法的一個可能[缺點](http://terms.naer.edu.tw/detail/3468352/)就是梯度的估測器(gradient estimators)可能存在著很大的方差。此外，隨著策略(policy)的變化，新的梯度的估測跟過去的估測會是無關的。因此，就舊信息的累積與鞏固而言是沒有"學習"的。 * (b) Critic-only，完全地依賴於價值函數近似(value function approximation)，旨在學習Bellman equation的近似解，然後會有希望可以取得一個near-optimal policy(接近最佳策略)。這類方法是間接的，因為它們不會在策略空間上嚐試直接最佳化。這類型的方法也許可以成功的構建一個"好的"價值函數(value function)的近似，不過在所得到的策略近似最佳的部份可能就缺少可靠的保證。 ::: :::info Actor-critic methods aim at combining the strong points of actor-only and critic-only methods. The critic uses an approximation architecture and simulation to learn a value function, which is then used to update the actor's policy parameters in a direction of performance improvement. Such methods, as long as they are gradient-based, may have desirable convergence properties, in contrast to critic-only methods for which convergence is guaranteed in very limited settings. They hold the promise of delivering faster convergence (due to variance reduction), when compared to actor-only methods. On the other hand, theoretical understanding of actor-critic methods has been limited to the case of lookup table representations of policies [6]. ::: :::success Actor-critic methods旨在結合actor-only與critic-only各自的強項。critic使用一個近似架構(approximation architechture)與模擬來學習價值函數，然後就把學到的東西往著效能提升的方向更新actor的策略參數。這類的方法，只要它們是基於梯度(gradient-based)，就可能會有理想的收斂特性，這跟在非常有限的條件下才能保證其收斂性的critic-only恰恰相反。與actor-only相比，它們有著更快的收斂速度。另一方面，對於actor-critic methods的理論理解就僅限於策略的[查表](http://terms.naer.edu.tw/detail/19252526/)表示的情況[6]。 ::: :::info In this paper, we propose some actor-critic algorithms and provide an overview of a convergence proof. The algorithms are based on an important observation. Since the number of parameters that the actor has to update is relatively small (compared to the number of states), the critic need not attempt to compute or approximate the exact value function, which is a high-dimensional object. In fact, we show that the critic should ideally compute a certain "projection" of the value function onto a low-dimensional subspace spanned by a set of "basis functions," that are completely determined by the parameterization of the actor. Finally, as the analysis in [11] suggests for TD algorithms, our algorithms can be extended to the case of arbitrary state and action spaces as long as certain ergodicity assumptions are satisfied. ::: :::success 論文中，我們會提出一些actor-critic的演算法，然後提出一個[概觀](http://terms.naer.edu.tw/detail/19398954/)的收斂證明。這些演算法是基於一個重要的觀察。因為actor要更新的參數數量相對較少(對比狀態(state)的數量)，所以說，critic並不需要嚐試去計算或是近似精確的價值函數(value function)，這是一個高維度的物件(high-dimensional object)。事實上，我們會說明證明，critic理想情況下應該計算價值函數在一個低維度子空間上的某個"[投影](http://terms.naer.edu.tw/detail/2122466/)"，這個子空間是由一組"[基函數](http://terms.naer.edu.tw/detail/397042/)"跨度(spanned)所組成的，基函數的部份則是完全地由參數化的actor所決定。最後，如[11]中對TD algorithms的分析所述，我們的演算法可以被擴展至任意的狀態(state)與動作(action)空間，只要滿足某些[遍歷性](http://terms.naer.edu.tw/detail/2115614/)假設。 ::: :::info We close this section by noting that ideas similar to ours have been presented in the simultaneous and independent work of Sutton et al. [10]. ::: :::success 後來我們發現到類似的想法Sutton et al. [10]已經提出了。 ::: ## 2 Markov decision processes and parameterized family of RSP's :::info Consider a Markov decision process with finite state space $S$, and finite action space $A$. Let $g:S \times A \to \mathbb{R}$ be a given cost function. A randomized stationary policy (RSP) is a mapping $\mu$ that assigns to each state $x$ a probability distribution over the action space $A$. We consider a set of randomized stationary policies $\mathbb{P} = \left\{\mu_\theta;\theta \in \mathbf{R}^n \right\}$, parameterized in terms of a vector $\theta$. For each pair $(x,u)\in S\times A, \mu_\theta(x,u)$ denotes the probability of taking action $u$ when the state $x$ is encountered, under the policy corresponding to $\theta$. Let $p_{xy}(u)$ denote the probability that the next state is $y$, given that the current state is $x$ and the current action is $u$. Note that under any RSP, the sequence of states $\left\{X_n\right\}$ and of state-action pairs $\left\{X_n, U_n \right\}$ of the Markov decision process form Markov chains with state spaces $S$ and $S\times A$, respectively. We make the following assumptions about the family of policies $\mathbb{P}$. ::: :::success 我們來考慮一個馬可夫決策過程，它有著有限的狀態空間(state space) $S$，有限的動作空間(action space) $A$。假設$g:S \times A \to \mathbb{R}$是一個給定的成本函數。randomized stationary policy (RSP)是一個映射函數$\mu$，它會為每個state $x$分配action space $A$上的機率分佈我們考慮一組隨機平穩策略$\mathbb{P} = \left\{\mu_\theta;\theta \in \mathbf{R}^n \right\}$，根據向量$\theta$參數化。每個state actipn pair $(x,u)\in S\times A, \mu_\theta(x,u)$都代表著在遇到state $x$的時候在相對於$\theta$的策略情況下去取到action $u$的機率。假設$p_{xy}(u)$表示給定當下的狀態$x$與當下的動作$u$的情況下，下一個狀態是$y$的機率。注意到，在任意的RSP(隨機平穩策略)情況下，馬可夫決策過程的狀態(state)序列$\left\{X_n\right\}$與狀態-動作組合(state-action pair)的序列$\left\{X_n, U_n \right\}$會分別形成具有狀態空間$S$與$S\times A$的馬可夫鏈。我們對這整個策略$\mathbb{P}$有著下面的假設。 ::: :::warning 這邊要稍微注意一下，論文中的state是以$x$來表示，而action則是$u$，與現行常見的$s, a$不同。 ::: :::info (AI) For all $x \in S$ and $u \in A$ the map $\theta \mapsto \mu_\theta(x,u)$ is twice differentiable with bounded first, second derivatives. Furthermore, there exists a $\mathbb{R}^n-$valued function $\psi_\theta(x, u)$ such that $\nabla\mu_\theta(x,u) = \mu_\theta(x,u)\psi_\theta(x,u)$ where the mapping $\theta \mapsto \psi_\theta(x,u)$ is bounded and has first bounded derivatives for any fixed $x$ and $u$. ::: :::success (A1) 所有的$x \in S$與$u \in A$，其映射$\theta \mapsto \mu_\theta(x,u)$在使用[有界](http://terms.naer.edu.tw/detail/19322655/)的(bounded)一階、二階導數下是二次可微的(twice differentiable)。此外，存在著一個$\mathbb{R}^n-$value function $\psi_\theta(x, u)$，這樣$\nabla\mu_\theta(x,u) = \mu_\theta(x,u)\psi_\theta(x,u)$，其中映射$\theta \mapsto \psi_\theta(x,u)$是[有界](http://terms.naer.edu.tw/detail/19322655/)的，而且對於任意固定的$x$與$u$都有著一階[有界](http://terms.naer.edu.tw/detail/19322655/)導數。 ::: :::info (A2) For each $\theta \in \mathbb{R}^n$, the Markov chains $\left\{X_n\right\}$ and $\left\{X_n, U_n \right\}$ are irreducible and aperiodic, with stationary probabilities $\pi_\theta(x)$ and $\eta_\theta(x,u)=\pi_\theta(x)\mu_\theta(x,u)$, respectively, under the RSP $\mu_\theta$. ::: :::success (A2) 對於每個$\theta \in \mathbb{R}^n$，[馬可夫鏈](http://terms.naer.edu.tw/detail/3650660/)$\left\{X_n\right\}$與$\left\{X_n, U_n \right\}$都是[不可約](http://terms.naer.edu.tw/detail/2118447/)且[非週期的](http://terms.naer.edu.tw/detail/2111274/)，在RSP $\mu_\theta$的情況下，其[平穩](http://terms.naer.edu.tw/detail/2125279/)機率分別為$\pi_\theta(x)$與$\eta_\theta(x,u)=\pi_\theta(x)\mu_\theta(x,u)$。 ::: :::info In reference to Assumption (AI), note that whenever $\mu_\theta(x,u)$ is nonzero we have $$ \psi_\theta(x,u) = \dfrac{\nabla \mu_\theta(x,u)}{\mu_\theta(x,u)} = \nabla \ln \mu_\theta(x,u) $$ Consider the average cost function $\lambda:\mathbb{R}^n\mapsto\mathbb{R}$, given by $$ \lambda(\theta) = \sum_{x\in S,u\in A}g(x,u)\eta_\theta(x,u) $$ We are interested in minimizing $\lambda(\theta)$ over all $\theta$. For each $\theta \in \mathbb{R}^n$, let $V_\theta : S \mapsto \mathbb{R}$ be the "differential" cost function, defined as solution of Poisson equation: $$ \lambda(\theta) + V_\theta(x) = \sum_{u\in A} \mu_\theta(x,u) \left[g(x,u) + \sum_y p_{xy}(u) V_\theta(y)\right] $$ ::: :::success 在參照假設(A1)的部份，我們要注意到，只要$\mu_\theta(x,u)$是非零的(nonzero)，那我們就會有 $$ \psi_\theta(x,u) = \dfrac{\nabla \mu_\theta(x,u)}{\mu_\theta(x,u)} = \nabla \ln \mu_\theta(x,u) $$ 考慮到平均成本函數$\lambda:\mathbb{R}^n\mapsto\mathbb{R}$是由下面方程式所給定 $$ \lambda(\theta) = \sum_{x\in S,u\in A}g(x,u)\eta_\theta(x,u) $$ 我們感興趣的是在所有$\theta$上最小化$\lambda(\theta)$。對每個$\theta \in \mathbb{R}^n$來說，假設$V_\theta : S \mapsto \mathbb{R}$是一個"[微分](http://terms.naer.edu.tw/detail/2114654/)(differential)"成本函數，把它定義成是[Poisson方程](http://terms.naer.edu.tw/detail/3189906/)的解： $$ \lambda(\theta) + V_\theta(x) = \sum_{u\in A} \mu_\theta(x,u) \left[g(x,u) + \sum_y p_{xy}(u) V_\theta(y)\right] $$ ::: :::info Intuitively, $V_\theta(x)$ can be viewed as the "disadvantage" of state $x$: it is the expected excess cost - on top of the average cost - incurred if we start at state $x$. It plays a role similar to that played by the more familiar value function that arises in total or discounted cost Markov decision problems. Finally, for every $\theta \in \mathbb{R}^n$, we define the $q-$function q_\theta: S \times A \to \mathbb{R}$, by $$ q_\theta(x,u) = g(x,u) - \lambda(\theta) + \sum_y p_{xy} (u) V_\theta(y) $$ ::: :::success 直觀來看，$V_\theta(x)$可以視為是state $x$的"[缺陷](http://terms.naer.edu.tw/detail/3467174/)"：如果我們從狀態$x$開始，我們可以說，這是一種預期的額外成本(excess cost)，這是除了平均成本之外的成本。它的作用就類似於在total cost或是discounted cost的馬可夫決策問題中所出現的那種更為常見的價值函數所擁有的作用。最後，對每個$\theta \in \mathbb{R}^n$來說，我們利用下面方程式定義$q-$function $q_\theta: S \times A \to \mathbb{R}$： $$ q_\theta(x,u) = g(x,u) - \lambda(\theta) + \sum_y p_{xy} (u) V_\theta(y) $$ ::: :::info We recall the following result, as stated in [8]. (Different versions of this result have been established in [3, 4, 5].) **Theorem 1.** $$ \dfrac{\partial}{\partial \theta_i}\lambda(\theta) = \sum_{x,u} \eta_\theta(x, u) q_\theta(x,u) \psi^i_\theta(s, u) \tag{1} $$ where $\psi^i_\theta(x, u)$ stands for the $i$th component of $\psi_\theta$. ::: :::success 我們來回想一下下面的結果，就如同[8]所述(這結果的不同版本已經在[3, 4, 5]內建立)。 **Theorem 1.** $$ \dfrac{\partial}{\partial \theta_i}\lambda(\theta) = \sum_{x,u} \eta_\theta(x, u) q_\theta(x,u) \psi^i_\theta(s, u) \tag{1} $$ 其中$\psi^i_\theta(x, u)$代表$\psi_\theta$第$i$個分量(component) ::: :::info In [8], the quantity $q_\theta(x,u)$ in the above formula is interpreted as the expected excess cost incurred over a certain renewal period of the Markov chain $\left\{X_n,U_n \right\}$, under the RSP $\mu_\theta$, and is then estimated by means of simulation, leading to actor-only algorithms. Here, we provide an alternative interpretation of the formula in Theorem 1, as an inner product, and thus derive a different set of algorithms, which readily generalize to the case of an infinite space as well. ::: :::success 在[8]裡面，上面公式的$q_\theta(x,u)$被解釋成是在RSP $\mu_\theta$的情況下，其馬可夫鏈$\left\{X_n,U_n \right\}$在某一定的固定週期內所產生的預期的額外成本，然後再透過模擬的方式來做估測，就能得到actor-only的演算法。這邊我們提供一個對於Theorem 1公式的另一種解釋，從做為內積的角度切入，然後我們要從裡面去推導出一算不同的演算法，這演算法也可以很輕易的泛化到無限空間的情況。 ::: :::info For any $\theta \in \mathbb{R}^n$, we define the inner product $\langle \cdot, \cdot \rangle_\theta$ of two real valued functions $q_1, q_2$ on $S \times A$, viewed as vectors in $\mathbb{R}^{\vert S \vert \vert A \vert}$, by $$ \langle q_1, q_2 \rangle_\theta = \sum_{x,u}\eta_\theta(x,u)q_1(x,u)q_2(x,u) $$ With this notation we can rewrite the formula (1) as $$ \dfrac{\partial}{\partial\theta_i}\lambda(\theta) = \langle q_\theta, \psi^i_\theta \rangle_\theta, i=1,\cdots, n $$ Let $\Vert \cdot \Vert_\theta$ denote the norm induced by this inner product on $\mathbb{R}^{\vert S \vert \vert A \vert}$. For $\theta \in \mathbb{R}^n$ let $\Psi_\theta$ denote the span of the vectors $\left\{ \psi^i_\theta;1\leq i\leq n\right\}$ in $\mathbb{R}^{\vert S \vert \vert A \vert}$. (This is same as the set of all functions $f$ on $S \times A$ of the form $f(x, u) = \sum^n_{i=1}\alpha_i\psi^i_\theta(x,u)$, for some scalars $\alpha_1,\cdots,\alpha_n$,) ::: :::success 對於任意的$\theta \in \mathbb{R}^n$，我們定義在$S \times A$上的兩個real valued functions $q_1, q_2$的內積$\langle \cdot, \cdot \rangle_\theta$，這被視為是$\mathbb{R}^{\vert S \vert \vert A \vert}$中的向量， $$ \langle q_1, q_2 \rangle_\theta = \sum_{x,u}\eta_\theta(x,u)q_1(x,u)q_2(x,u) $$ 有了這個[記號](http://terms.naer.edu.tw/detail/2120707/)，我們就可以來改寫公式1， $$ \dfrac{\partial}{\partial\theta_i}\lambda(\theta) = \langle q_\theta, \psi^i_\theta \rangle_\theta, i=1,\cdots, n $$ 假設，$\Vert \cdot \Vert_\theta$表示由$\mathbb{R}^{\vert S \vert \vert A \vert}$上的內積所[導出](https://terms.naer.edu.tw/search/?q=induced&field=ti&op=AND&q=noun:%22%E6%95%B8%E5%AD%B8%E5%90%8D%E8%A9%9E%22&field=&op=AND&num=10&page=2)的範數。對每個$\theta \in \mathbb{R}^n$來說，假設$\Psi_\theta$表示$\mathbb{R}^{\vert S \vert \vert A \vert}$中的向量$\left\{ \psi^i_\theta;1\leq i\leq n\right\}$的span([展成](http://terms.naer.edu.tw/detail/2125009/))(對某些[純量](http://terms.naer.edu.tw/detail/2124076/)$\alpha_1,\cdots,\alpha_n$來說，這與$f(x, u) = \sum^n_{i=1}\alpha_i\psi^i_\theta(x,u)$這個形式的$S \times A$上的所有函數$f$的集合是相同的) ::: :::info Note that although the gradient of $\lambda$ depends on the $q$-function, which is a vector in a possibly very high dimensional space $\mathbb{R}^{\vert S \vert \vert A \vert}$, the dependence is only through its inner products with vectors in $\Psi_\theta$ . Thus, instead of "learning" the function $q_\theta$, it would suffice to learn the projection of $q_\theta$ on the subspace $\Psi_\theta$. ::: :::success 注意到，雖然$\lambda$的梯度是取決於$q-$function，它可能是一個非常高維度空間$\mathbb{R}^{\vert S \vert \vert A \vert}$中的一個向量，不過它的[相依性](http://terms.naer.edu.tw/detail/2114475/)僅僅是透過它與$\Psi_\theta$中的向量之間的內積來維持。因此，與其要"學習"function $q_\theta$，還不如學習$q_\theta$在子空間$\Psi_\theta$上的[投影](http://terms.naer.edu.tw/detail/3635503/)就足夠了。 ::: :::info Indeed, let $\prod_\theta: \mathbb{R}^{\vert S \vert \vert A \vert} \mapsto \Psi_\theta$ be the projection operator defined by $$ \prod_\theta q = \arg\min_{\hat{q}\in\Psi_\theta} \Vert q - \hat{q} \Vert_\theta $$ Since $$ \langle q_\theta,\psi_\theta \rangle_\theta = \langle \prod_\theta q_\theta, \psi_\theta \rangle_\theta \tag{2} $$ it is enough to compute the projection of $q_\theta$ onto $\Psi_\theta$. ::: :::success 假設，$\prod_\theta: \mathbb{R}^{\vert S \vert \vert A \vert} \mapsto \Psi_\theta$是由下面公式所定義的[投影算子](http://terms.naer.edu.tw/detail/2122470/) $$ \prod_\theta q = \arg\min_{\hat{q}\in\Psi_\theta} \Vert q - \hat{q} \Vert_\theta $$ 因此， $$ \langle q_\theta,\psi_\theta \rangle_\theta = \langle \prod_\theta q_\theta, \psi_\theta \rangle_\theta \tag{2} $$ 這就已經足夠去計算$q_\theta$在$\Psi_\theta$上的投影了。 ::: ## 3 Actor-critic algorithms :::info We view actor critic-algorithms as stochastic gradient algorithms on the parameter space of the actor. When the actor parameter vector is $\theta$, the job of the critic is to compute an approximation of the projection $\prod_\theta q_\theta$ of $q_\theta$ onto $\Psi_\theta$. The actor uses this approximation to update its policy in an approximate gradient direction. The analysis in [11, 12] shows that this is precisely what TD algorithms try to do, i.e., to compute the projection of an exact value function onto a subspace spanned by feature vectors. This allows us to implement the critic by using a TD algorithm. (Note, however, that other types of critics are possible, e.g., based on batch solution of least squares problems, as long as they aim at computing the same projection.) ::: :::success 我們把actor critic-algorithms視為是actor的參數空間上的隨機梯度演算法。當actor parameter vector為$\theta$，那critic的工作就是計算$q_\theta$在$\Psi_\theta$上投影$\prod_\theta q_\theta$的近似值。actor在一個近似的梯度方向中用這個近似值來更新它的策略。這在[11, 12]中的分析已經說明，這正是TD演算法嚐試在做的事，也就是計算精確的價值函數在一個由特徵向量所[展成](http://terms.naer.edu.tw/detail/2125009/)的子空間上的投影。這讓我們可以透過使用TD演算法來實作critic(不過，還是要注意一下，其它類型的critic也是可以的啦，像是基於[最小平方](http://terms.naer.edu.tw/detail/19376009/)的批次解決方案問題，只要它們的目的是在計算相同的投影)。 ::: :::info We note some minor differences with the common usage of TD. In our context, we need the projection of q-functions, rather than value functions. But this is easily achieved by replacing the Markov chain {xt} in [11, 12] by the Markov chain {Xn, Un}. A further difference is that [11, 12] assume that the control policy and the feature vectors are fixed. In our algorithms, the control policy as well as the features need to change as the actor updates its parameters. As shown in [6, 2], this need not pose any problems, as long as the actor parameters are updated on a slower time scale. ::: :::success 我們有注意到這跟TD的常見用法有些許的差異。在我們的內文中，我們需要的是$q-$functions的投影，而不是value functions的。不過不用擔心，這很簡單處理，只要把[11, 12]裡面的馬可夫鏈$\left\{x_t \right\}$用馬可夫鏈$\left\{X_n, U_n \right\}$來替代掉就可以了。另一個差別就是，[11, 12]裡面假設控制策略(control policy)與特徵向量(feature vectors)是固定的。在我們的演算法中，控制策略與特徵必需要隨著actor更新它的參數而改變。如[6, 2]中所說明，這並不會造成任何的問題，只要actor parameters是在一個較慢(slower)的[時間尺度](http://terms.naer.edu.tw/detail/2126364/)(time scale)上更新就可以。 ::: :::info We are now ready to describe two actor-critic algorithms, which differ only as far as the critic updates are concerned. In both variants, the critic is a TD algorithm with a linearly parameterized approximation architecture for the $q-$-function, of the form $$ Q^\theta_r(x,u)=\sum^m_{j=1}r^j \phi^j_\theta(x,u) $$ where $r = (r^1,\cdots,r^m) \in \mathbb{R}^m$ denotes the parameter vector of the critic. The features $\phi^j_\theta,j=1,\cdots,m,$ used by the critic are dependent on the actor parameter vector $\theta$ and are chosen such that their span in $\mathbb{R}^{\vert S \vert \vert A \vert}$, denoted by $\Phi_\theta$, contains $\Psi_\theta$. Note that the formula (2) still holds if $\prod_\theta$ is redefined as projection onto $\Phi_\theta$ as long as $\Phi_\theta$ contains $\Psi_\theta$. The most straightforward choice would be to let $m=n$ and $\phi^i_\theta=\psi^i_\theta$ for each $i$. Nevertheless, we allow the possibility that $m>n$ and $\Phi_\theta$ properly contains $\Psi_\theta$, so that the critic uses more features than that are actually necessary. This added flexibility may turn out to be useful in a number of ways: 1. It is possible for certain values of $\theta$, the features $\psi_\theta$ are either close to zero or are almost linearly dependent. For these values of $\theta$, the operator $\prod_\theta$ becomes ill-conditioned and the algorithms can become unstable. This might be avoided by using richer set of features $\psi^i_\theta$. 2. For the second algorithm that we propose (TD($\alpha$)$\alpha <1$) critic can only compute approximate - rather than exact - projection. The use of additional features can result in a reduction of the approximation error. ::: :::success 現在我們已經準備來說說兩種actor-critic algorithms，它們只有在critic更新的部份有所不同。在這兩種[變體](http://terms.naer.edu.tw/detail/19437516/)中，critic都是一種TD演算法，其$q-$function具有線性參數化近似架構，形式如下： $$ Q^\theta_r(x,u)=\sum^m_{j=1}r^j \phi^j_\theta(x,u) $$ 其中$r = (r^1,\cdots,r^m) \in \mathbb{R}^m$表示critic的參數向量。critic所使用的特徵$\phi^j_\theta,j=1,\cdots,m,$則是取決於actor的參數向量$\theta$，被選中之後它們的[展成](http://terms.naer.edu.tw/detail/9363494/)(span)就會在$\mathbb{R}^{\vert S \vert \vert A \vert}$，以$\Phi_\theta$來表示，這包含著$\Psi_\theta$。注意到，如果$\prod_\theta$重新定義為是在$\Phi_\theta$上的投影(projection)，那只要$\Phi_\theta$包含著$\Psi_\theta$，那公式2就仍然是成立的。最直接的選擇就是讓$m=n$，然後每個$i$的$\phi^i_\theta=\psi^i_\theta$。儘管如此，我們是允許$m>n$且$\Phi_\theta$適當地包含$\Psi_\theta$的可能性的，這樣，critic就會使用比實際需求還要多的特徵。這樣所增加的靈活性可能在很多許方都還蠻有用的： 1. 對於某業的$\theta$的值來說，可能其特徵$\psi_\theta$不是很接近0就是幾乎線性相關。對於這些$\theta$的值來說，其[算子](http://terms.naer.edu.tw/detail/3189594/)$\prod_\theta$就會變成是[病態的](http://terms.naer.edu.tw/detail/3649612/)，而且演算法會變的不穩定。這也許可以用一些更為豐富的特徵集合$\psi^i_\theta$來避免。 2. 對於我們提出的第二種演算法(TD($\alpha$)$\alpha <1$)，critic只能計算近似，而非精確的投影(projection)。使用額外的特徵可以減少近似誤差的產生。 ::: :::warning 上面一小段翻譯(，被選中之後它們的[展成](http://terms.naer.edu.tw/detail/9363494/)(span)就會在$\mathbb{R}^{\vert S \vert \vert A \vert}$，)的不是很確定，如果有誤再請指導。 ::: :::info Along with the parameter vector $r$, the critic stores some auxiliary parameters: these are a (scalar) estimate $\lambda$, of the average cost, and an $m$-vector $z$ which represents Sutton's eligibility trace [1, 9]. The actor and critic updates take place in the course of a simulation of a single sample path of the controlled Markov chain. Let $r_k, z_k, \lambda_k$ be the parameters of the critic, and let $\theta_k$ be the parameter vector of the actor, at time $k$. Let $(X_k,U_k)$ be the state-action pair at that time. Let $X_{k+1}$ be the new state, obtained after action $U_k$ is applied. A new action $U_{k+1}$ is generated according to the RSP corresponding to the actor parameter vector $\theta_k$. The critic carries out an update similar to the average cost temporal-difference method of [12]: $$ \lambda_{k+1} = \lambda_k + \gamma_k(g(X_k,U_k) - \lambda_k) $$ $$ r_{k+1} = r_k + \gamma_k(g(X_k,U_k)-\lambda_k + Q_{r_{k}}^{\theta_{k}}(X_{k+1},U_{k+1}) - Q_{r_k}^{\theta_k}(X_k,U_k))z_k $$ ::: :::success 跟著參數向量$r$，cirtic儲存一些輔助參數：這些參數是平均成本的估測值$\lambda$，還有一個$m$-vector $z$([1,9]，表示Sutton所說的eligibility trace)。actor與critic的更新會在受控制的馬可夫鏈(controlled Markov chain)的單個[樣本路徑](http://terms.naer.edu.tw/detail/2124026/)(single sample path)的模擬過程中發生。假設$r_k, z_k, \lambda_k$是critic的參數，然後，$\theta_k$是actor的參數向量，在時間$k$。假設$(X_k,U_k)$是時間在該時間的state-action pair。假設，$X_{k+1}$是在採取action $U_k$之後得到的新的狀態。新的action $U_{k+1}$會根據RSP相對應的actor參數向量$\theta_k$來生成。critic的更新就類似於[12]的average cost temporal-difference method： $$ \lambda_{k+1} = \lambda_k + \gamma_k(g(X_k,U_k) - \lambda_k) $$ $$ r_{k+1} = r_k + \gamma_k(g(X_k,U_k)-\lambda_k + Q_{r_{k}}^{\theta_{k}}(X_{k+1},U_{k+1}) - Q_{r_k}^{\theta_k}(X_k,U_k))z_k $$ ::: :::info (Here, $\gamma_k$ is a positive stepsize parameter.) The two variants of the critic use different ways of updating $z_k$: * TD(1) Critic: Let $x^*$ be a state in $S$. $$ \begin{align} z_{k+1} &= z_k + \phi_{\theta_k}(X_{k+1}, U_{k+1},) \text{ if $X_{k+1} \neq x^*$} \\ &= \phi_{\theta_k}(X_{k+1}, U_{k+1}), \text{ otherwise. } \end{align} $$ * TD($\alpha$) Critic，$0 \leq \alpha < 1$： $$ z_{k+1} = \alpha z_k + \phi_{\theta_k(X_{k+1}, U_{k+1})} $$ * Actor: Finally, the actor updates its parameter vector by letting $$ \theta_{k+1} = \theta_k - \beta_k\Gamma(r_k)Q^{\theta_k}_{r_k}(X_{k+1}, U_{k+1})\psi_{\theta_k}(X_{k+1}, U_{k+1}) $$ Here, $\beta_k$ is a positive stepsize and $\Gamma(r_k>0)$ is a normalization factor satisfying: * (A3) $\Gamma(\cdot)$ is Lipschitz continuous. * (A4) There exists $C > 0$ such that $$ \Gamma(r) \leq \dfrac{C}{1 + \Vert r \Vert} $$ ::: :::success (這邊，$\gamma_k$是一個positive stepsize parameter(就是正數的學習效率))。critic的兩個變體用著不同的方法來更新$z_k$： * TD(1) Critic：假設$x^*$是$S$中的一個狀態， $$ \begin{align} z_{k+1} &= z_k + \phi_{\theta_k}(X_{k+1}, U_{k+1},) \text{ if $X_{k+1} \neq x^*$} \\ &= \phi_{\theta_k}(X_{k+1}, U_{k+1}), \text{ otherwise. } \end{align} $$ * TD($\alpha$) Critic，$0 \leq \alpha < 1$： $$ z_{k+1} = \alpha z_k + \phi_{\theta_k(X_{k+1}, U_{k+1})} $$ * Actor：最後，actor透過下面公式更新它的參數個量： $$ \theta_{k+1} = \theta_k - \beta_k\Gamma(r_k)Q^{\theta_k}_{r_k}(X_{k+1}, U_{k+1})\psi_{\theta_k}(X_{k+1}, U_{k+1}) $$ 這裡的$\beta_k$是一個positive stepsize，然後$\Gamma(r_k>0)$正規化因子，滿足下面條件： * (A3) $\Gamma(\cdot)$為Lipschitz continuous. * (A4) $C > 0$，因此， $$ \Gamma(r) \leq \dfrac{C}{1 + \Vert r \Vert} $$ ::: :::info The above presented algorithms are only two out of many variations. For instance, one could also consider "episodic" problems in which one starts from a given initial state and runs the process until a random termination time (at which time the process is reinitialized at $x^*$), with the objective of minimizing the expected cost until termination. In this setting, the average cost estimate $\lambda_k$ is unnecessary and is removed from the critic update formula. If the critic parameter $r_k$ were to be reinitialized each time that $x^*$ is entered, one would obtain a method closely related to Williams' REINFORCE algorithm [13]. Such a method does not involve any value function learning, because the observations during one episode do not affect the critic parameter $r$ during another episode. In contrast, in our approach, the observations from all past episodes affect current critic parameter $r$, and in this sense critic is "learning". This can be advantageous because, as long as $\theta$ is slowly changing, the observations from recent episodes carry useful information on the $q$q-function under the current policy. ::: :::success 上面說的也只是眾多變體中的其中兩種。舉例來說，你還可以考慮"episodic"的問題，就是那種你給定一個初始狀態開始，然後一直執行到一個隨機的終止時間(這時候這過程是在$x^*$重新初始化)，目標就是最小化一直到結束的期望成本。這種情況下，平均成本的估測值$\lambda_k$是不必要的，也會從critic的更新公式中被移除掉。如果每次輸入$x^*$都要重新初始化cirtic的參數$r_k$，那就會得到一個跟[13]的Williams REINFORCE algorithm非常相關的方法。這方法並沒有涉及任何value function的學習，因為在一個episode中觀測到的東西並不會影響另一個episode中的critic的參數$r$。相比之下，在我們的方法中，所有過去的episodes所觀測到的東西都會影響到當下的critic的參數$r$，從這個意義上來說，critic正在"學習"。這可能是有利的啦，因為，只要$\theta$緩慢的變化，那你現在的策略的$q$-function上面就會有從最近幾次的episodes所觀測到的有用信息。 ::: ## 4 Convergence of actor-critic algorithms :::info Since our actor-critic algorithms are gradient-based, one cannot expect to prove convergence to a globally optimal policy (within the given class of RSP's). The best that one could hope for is the convergence of $\nabla\lambda(\theta)$ to zero; in practical terms, this will usually translate to convergence to a local minimum of $\lambda(\theta)$. Actually, because the $TD(\alpha)$ critic will generally converge to an approximation of the desired projection of the value function, the corresponding convergence result is necessarily weaker, only guaranteeing that $\nabla\lambda(\theta_k)$ becomes small (infinitely often). Let us now introduce some further assumptions. ::: :::success 因為我們的actor-critic演算法是基於梯度的，所以你不用想要可以證明收斂到全域最佳策略(在給定RSP內)。可能我們會希望得到的最好結果是$\nabla\lambda(\theta)$收斂到0；不過實際上通常是轉換成收斂到$\lambda(\theta)$的局部最小值。事實上，因為$TD(\alpha)$ critic一般會收斂到value function的期望投影(desired projection)的近似值，所以它的相對應的收斂結果也必然較弱，只能夠保證$\nabla\lambda(\theta_k)$會變小(無限多個這樣的事件)。現在讓我們來介紹一些進一步的假設。 ::: :::info (A5) For each $\theta \in \mathbb{R}^n$, we define an $m \times m$ matrix $G(\theta)$ by $$ G(\theta) = \sum_{x,u}\eta_\theta(x,u)\phi_\theta(x,u)\phi_\theta(x,u)^T $$ We assume that $G(\theta)$ is uniformly positive definite, that is, there exists some $\epsilon_1>0$ such that for all $r\in\mathbb{R}^m$ and $\theta \in\mathbb{R}^n$ $$ r^TG(\theta)r\geq \epsilon_1\Vert r \Vert^2 $$ ::: :::success (A5) 對於每個$\theta \in \mathbb{R}^n$，我們透過下面公式定義一個$m \times m$的矩陣，$G(\theta)$： $$ G(\theta) = \sum_{x,u}\eta_\theta(x,u)\phi_\theta(x,u)\phi_\theta(x,u)^T $$ 我們假設$G(\theta)$是uniformly positive definite(均勻[正定的](http://terms.naer.edu.tw/detail/2122050/))，也就是說，存在某此$\epsilon_1>0$，這樣的話，對所有的$r\in\mathbb{R}^m$與$\theta \in\mathbb{R}^n$來說， $$ r^TG(\theta)r\geq \epsilon_1\Vert r \Vert^2 $$ ::: :::info (A6) We assume that the stepsize sequences $\left\{\gamma_k \right\} \left\{\beta_k \right\}$ are positive, nonincreasing, and satisfy $$ \delta_k > 0, \forall{k}, \sum_{k}\delta_k =\infty, \sum_k \delta^2_k < \infty $$ where $\delta_k$ stands for either $\beta_k$ or $\gamma_k$. We also assume that $$ \dfrac{\beta_k}{\gamma_k} \to 0 $$ Note that the last assumption requires that the actor parameters be updated at a time scale slower than that of critic. ::: :::success (A6) 我們假設stepsize sequences $\left\{\gamma_k \right\} \left\{\beta_k \right\}$都是正數的，非遞增，而且滿足下面： $$ \delta_k > 0, \forall{k}, \sum_{k}\delta_k =\infty, \sum_k \delta^2_k < \infty $$ 其中$\delta_k$代表著$\beta_k$與$\gamma_k$。我們還假設下面這件事： $$ \dfrac{\beta_k}{\gamma_k} \to 0 $$ 注意到，最後一個假設要求actor的參數更新要在time scale比critic還要慢的情況下。(這段不是那麼確定) ::: :::info **Theorem 2.** In an actor-critic algorithm with a $TD(1)$ critic, $$ \lim\inf_k\Vert \nabla \lambda(\theta_k) \Vert = 0 \space \text{ w.p. 1.} $$ Furthermore, if $\left\{\theta_k \right\}$ is bounded w.p. 1 then $$ \lim_k \Vert\nabla\lambda(\theta_k)\Vert = 0 \space \text{ w.p. 1.} $$ ::: :::success **Theorem 2.** 在一個使用$TD(1)$ critic的actor-critic algorithm，其 $$ \lim\inf_k\Vert \nabla \lambda(\theta_k) \Vert = 0 \space \text{ w.p. 1.} $$ 此外，如果$\left\{\theta_k \right\}$是以w.p. 1為界(bounded)，那麼： $$ \lim_k \Vert\nabla\lambda(\theta_k)\Vert = 0 \space \text{ w.p. 1.} $$ ::: :::warning w.p.1 指的就是with probability 1 ::: :::info **Theorem 3.** For every $\epsilon > 0$, there exists $\alpha$ sufficiently close to 1, such that $\lim\inf_k\Vert\nabla\lambda(\theta_k)\Vert\leq\epsilon \text{ w.p. 1}$. Note that the theoretical guarantees appear to be stronger in the case of the TD(1) critic. However, we expect that TD($\alpha$) will perform better in practice because of much smaller variance for the parameter $r_k$. (Similar issues arise when considering actor-only algorithms. The experiments reported in [7] indicate that introducing a forgetting factor $\alpha < 1$ can result in much faster convergence, with very little loss of performance.) We now provide an overview of the proofs of these theorems. Since $\beta_k / \gamma_k \to 0$, the size of the actor updates becomes negligible compared to the size of the critic updates. Therefore the actor looks stationary, as far as the critic is concerned. Thus, the analysis in [1] for the TD(1) critic and the analysis in [12] for the TD($\alpha$) critic (with $\alpha < 1$) can be used, with appropriate modifications, to conclude that the critic's approximation of IIokqok$\prod_{\theta_k}q_{\theta_k}$ will be "asymptotically correct". If $r(\theta)$ denotes the value to which the critic converges when the actor parameters are fixed at $\theta$, then the update for the actor can be rewritten as $$ \theta_{k+1} = \theta_k - \beta_k \Gamma(r(\theta_k))Q^{\theta_k}_{r(\theta_k)}(X_{k+1}, U_{k+1})\psi_{\theta_k}(X_{k+1}, U_{k+1}) + \beta_k e_k $$ where $e_k$ is an error that becomes asymptotically negligible. At this point, standard proof techniques for stochastic approximation algorithms can be used to complete the proof. ::: :::success **Theorem 3.** 對於每個$\epsilon > 0$，存在著一個明顯接近1的$\alpha$，這樣子就會$\lim\inf_k\Vert\nabla\lambda(\theta_k)\Vert\leq\epsilon \text{ w.p. 1}$ 注意到，在TD(1) critic的情況下，這理論保證似乎看強了。然而，我們預期TD($\alpha$)在實務上會表現的比較好，因為對參數$r_k$來說有著較小的方差(類似的問題在單純考慮actor的演算法也是有的。[7]的實驗結果中說明了，引入一個forgetting factor $\alpha < 1$能夠讓收斂的速度變快，而且在非常小的效能損失的情況下)。我們現在來概略說明一下這些理論的證明。因為$\beta_k / \gamma_k \to 0$，所以actor的更新相較於critic的更新大小來說就變的微不足道(可以忽略)。因此，對cirtic來說，actor看起來是[平穩](http://terms.naer.edu.tw/detail/2125279/)的。因此，我們可以適當的修改，把[1]中的分析弄來修正TD(1) critic，然後[12]的分析則是用在TD($\alpha$) critic(以$\alpha < 1$)，然後我們就得到一個結論，那就是，critic的$\prod_{\theta_k}q_{\theta_k}$的近似值將會"近似正確"。如果$r(\theta)$表示在actor參數固定為$\theta$的時候critic會收斂到的那個值，那actor的更新就可以重新寫成： $$ \theta_{k+1} = \theta_k - \beta_k \Gamma(r(\theta_k))Q^{\theta_k}_{r(\theta_k)}(X_{k+1}, U_{k+1})\psi_{\theta_k}(X_{k+1}, U_{k+1}) + \beta_k e_k $$ 其中$e_k$是一個[漸近](http://terms.naer.edu.tw/detail/19315491/)可忽略的誤差。這時候，我們就可以用標準的證明技術來完成這個隨機近似演算法的證明了。 ::: ## 5 Conclusions :::info The key observation in this paper is that in actor-critic methods, the actor parameterization and the critic parameterization need not, and should not be chosen independently. Rather, an appropriate approximation architecture for the critic is directly prescribed by the parameterization used in actor. ::: :::success 這篇論文的主要關點在於，在actor-critic的方法中actor的參數化與critic的參數化，不必，也不應該各自的選擇。相反的，critic適當的approximation architecture(近似架構)是應該直接由actor所使用的參數化來規定。 ::: :::info Capitalizing on the above observation, we have presented a class of actor-critic algorithms, aimed at combining the advantages of actor-only and critic-only methods. In contrast to existing actor-critic methods, our algorithms apply to high-dimensional problems (they do not rely on lookup table representations), and are mathematically sound in the sense that they possess certain convergence properties. ::: :::success 利用上述觀點，我們提出一種actor-critic演算法，主要是結合actor-only與critic-only的方法的優點。對比於現有的actor-critic方法，我們的演算法能夠用在高維度的問題上(不依賴於查表表示上)，而且因為某些收斂特性讓我們的演算法在數學上是顯的合理的。 :::