---
tags: RL
---
# Reinforcement Learning | Policy Gradient
上一篇筆記:
https://hackmd.io/@tsen159/RLNote
內容包含 RL 的介紹、Markov decision process、model-based evaluation and control。
___
Model-free RL 演算法可根據我們想要最佳化的目標,分為 value-based、policy-based 和混雜的 actor-critic:
* Valued-based:一種基於 value function 的方法,試圖直接學習最優 policy 的 value function,而不是學習最優 policy 本身
* Policy-based:直接學習 policy,相較於 value-based 更適合用在高維或連續的 action spaces,且可以學習 stochastic policy

## Policy Optimization
我們將 policy-based RL 視為一個 optimization problem。考慮 $\pi_\theta(s,a)$ 並定義
$$
V^{\pi_\theta}(\mu) := \mathbb{E}_{s_0 \sim \mu}[V^{\pi_\theta}(s_0)]
$$
目標是希望可以找到讓 $V^{\pi_{\theta}}$ 最大的 policy parameter $\theta$,即找到
$$
\arg \max_\theta V^{\pi_{\theta}}(\mu)
$$
> $\mu$ 表示 $s_0$ 的分布,$\mathbb{E}_{s_0 \sim \mu}$:可以理解為對所有可能的 $s_0$ 取 weighted sum
Policy optimization 包括 gradient-based 與 gradient-free,gradient-free 的方法包含如 cross-entropy method、hill climbing 等等;而 gradient-based 的方法通常可以有更好的效率,其中 policy gradient 即為一種 gradient-based 的方法。
:::info
一個進行 policy optimization 的例子如下:
1. Parametrize the policy by a neural network.
$$
S \rightarrow \pi_\theta(\cdot) \rightarrow \text{action distribution}
$$
2. update $\pi_\theta$ iteratively, e.g., by gradient ascent
$$
\theta \leftarrow \theta + \alpha \nabla_\theta V^{\pi_{\theta}}(\mu)
$$
:::
Policy 可以如何參數化呢?一些 parametric policies 的例子:
**Discrete action space**
* Softmax policies:
$$
\pi_\theta(a|s) = \frac{\exp(\theta_{s, a})}{\sum_{a^\prime \in A} \exp(\theta_{s, a^\prime})}
$$
> 對每一個 (s, a) pair 都給定一個 parameter $\theta$
* Linear softmax policies:
$$
\pi_\theta(a|s) = \frac{\exp(\theta^T\phi_{s, a})}{\sum_{a^\prime \in A} \exp(\theta^T\phi_{s, a^\prime})}
$$
> 對每一個 (s, a) pair 設計一個 feature $\phi$
* Neural softmax policies:
$$
\pi_\theta(a|s) = \frac{\exp(f_\theta(s, a))}{\sum_{a^\prime \in A} \exp(f_\theta(s, a^\prime))}
$$
> $f$ 為一 neural network
**Continuous action space**
* Gaussian policies:
$$
a \sim N(f_\theta(s), \sigma^2)
$$
## Policy Gradient
定義 $V(\theta) = V^{\pi_\theta}(\mu)$,則 policy gradient algorithms 透過 gradient ascent 尋求 $V(\theta)$ 的最大值,更新參數的方式為
$$
\theta \leftarrow \theta + \alpha \nabla_\theta V(\theta)
$$
其中 $\nabla_\theta V(\theta)$ 即是 policy gradient
$$
\nabla_\theta V(\theta) =
\left[
\begin{array}{ccc}
\frac{\partial V(\theta)}{\partial \theta_1} \\
\frac{\partial V(\theta)}{\partial \theta_2} \\
\vdots \\
\frac{\partial V(\theta)}{\partial \theta_n}
\end{array}
\right]
$$
定義:
:::warning
**Sample return** $R(\tau)$ along a trajectory $\tau = (s_0, a_0, r_1, ...)$
$$
R(\tau) := \sum_{t=0}^\infty \gamma^t \cdot r_{t+1}(\tau)
$$
:::
我們便可以改寫先前定義的 value function 為:
$$
V(\theta) = V^{\pi_\theta}(\mu) = \mathbb{E}_{\tau \sim P^{\pi_\theta}_\mu}[R(\tau)] = \sum_\tau R(\tau) P^{\pi_\theta}_\mu (\tau)
$$
其中 $P^{\pi_\theta}_\mu$ 是 $\tau$ 的分佈,$P^{\pi_\theta}_\mu (\tau)$ 表執行 policy $\pi_\theta$ 時某個 trajectory $\tau$ 的機率。
因此 policy gradient 為
$$
\begin{aligned}
\nabla_\theta V(\theta) &= \sum_\tau R(\tau) \nabla_\theta P^{\pi_\theta}_\mu (\tau) \\
&= \sum_\tau R(\tau) (P^{\pi_\theta}_\mu (\tau) \cdot \nabla_\theta \log P^{\pi_\theta}_\mu (\tau)) \ \ \ \ \small\text{//by chain rule}\\
&= \sum_\tau R(\tau) \Bigr(P^{\pi_\theta}_\mu (\tau) \cdot \nabla_\theta \log(\overbrace{\mu(s_0)}^{\text{initial state dist.}} \pi_\theta(a_0|s_0)P(s_1|s_0, a_0)\pi_\theta(a_1|s_1) ...) \Bigr) \\
&= \sum_\tau R(\tau) \Bigr(P^{\pi_\theta}_\mu (\tau) \cdot \nabla_\theta(\log \mu(s_0)+\sum_{t=0}^\infty \log \pi_\theta(a_t|s_t) +\log P(s_{t+1}|s_t, a_t)) \Bigr) \\
&= \sum_\tau R(\tau) \Bigr(P^{\pi_\theta}_\mu (\tau) \cdot \sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t|s_t) \Bigr) \ \ \ \ \small\text{//derivative =0 if independent from $\theta$}\\
&= \mathbb{E}_{\tau \sim P^{\pi_\theta}_\mu} \Bigr[R(\tau) \sum_{t=0}^\infty \nabla_\theta \log \pi_\theta (a_t|s_t) \Bigr]
\end{aligned}
$$
> By chian rule, $\nabla_x \log f(x) = \frac{1}{f(x)}\nabla_x f(x)$
這裡 $\nabla_\theta \log \pi_\theta (a|s)$ 稱為 score function。
我們得出的結果
$$
\nabla_\theta V^{\pi_\theta}(\mu) = \mathbb{E}_{\tau \sim P^{\pi_\theta}_\mu} \Bigr[R(\tau) \sum_{t=0}^\infty \nabla_\theta \log \pi_\theta (a_t|s_t) \Bigr]
$$
即是 policy gradient 的其中一種計算方法,稱為 total reward method。
:::info
**Total Reward Policy Gradient** (P1)
$$
\nabla_\theta V^{\pi_\theta}(\mu) = \mathbb{E}_{\tau \sim P^{\pi_\theta}_\mu} \Bigr[R(\tau) \sum_{t=0}^\infty \nabla_\theta \log \pi_\theta (a_t|s_t) \Bigr]
$$
:::
這個式子直觀而言,我們可以想成是對於某個 trajectory $\tau$,如果其 return $R(\tau)$ 相對於其他的 trajectories 較大,則它對於 gradient 的貢獻也會較大。
___
Policy gradient 也有其他的表示形式:
:::info
**REINFORCE** (P2)
$$
\nabla_\theta V^{\pi_\theta}(\mu) = \mathbb{E}_{\tau \sim P^{\pi_\theta}_\mu} \Bigr[\sum_{t=0}^\infty \gamma^t Q^{\pi_\theta}(s_t, a_t) \nabla_\theta \log \pi_\theta (a_t|s_t) \Bigr]
$$
:::
***proof***:
Recall that
$$
V^{\pi_\theta}(s_0) = \sum_{a_0 \in A} \pi_\theta (a_0|s_0) Q^{\pi_\theta} (s_0, a_0)
$$
therefore,
$$
\begin{aligned}
\nabla_\theta V^{\pi_\theta}(s_0) &= \sum_{a_0} \nabla_\theta (\pi_\theta (a_0|s_0) Q^{\pi_\theta}(s_0, a_0)) \\
&= \underbrace{\sum_{a_0} \nabla_\theta (\pi_\theta (a_0|s_0)) Q^{\pi_\theta}(s_0, a_0)}_{(1)} + \underbrace{\sum_{a_0} \pi_\theta (a_0|s_0) \nabla_\theta Q^{\pi_\theta}(s_0, a_0)}_{(2)} \ \ \ \small\text{//by product rule} \\
\\
(1) &= \sum_{a_0} \pi_\theta (a_0|s_0) \cdot \nabla_\theta (\log \pi_\theta (a_0|s_0)) \cdot Q^{\pi_\theta}(s_0, a_0) \ \ \ \small\text{//by chain rule} \\
&= \mathbb{E}_{\tau \sim P^{\pi_\theta}_{s_0}} \Bigr[ \nabla_\theta \log \pi_\theta (a_0|s_0) \cdot Q^{\pi_\theta}(s_0, a_0) \Bigr] \\
\\
(2) &= \sum_{a_0} \pi_\theta (a_0|s_0) \nabla_\theta Q^{\pi_\theta}(s_0, a_0) \\
&= \sum_{a_0}\pi_\theta (a_0|s_0) \nabla_\theta (r(s_0, a_0) + \gamma \sum_{s_1} P(s_1|s_0, a_0)\cdot V^{\pi_\theta}(s_1)) \\
&= \sum_{a_0}\pi_\theta (a_0|s_0) (\gamma \sum_{s_1} P(s_1|s_0, a_0)\cdot \nabla_\theta V^{\pi_\theta}(s_1))\\
&= \mathbb{E}_{\tau \sim P^{\pi_\theta}_{s_0}} [\gamma \nabla_\theta V^{\pi_\theta} (s_1)]\\
\\
=> \nabla_\theta V^{\pi_\theta}(s_0) &= \mathbb{E}_{\tau \sim P^{\pi_\theta}_{s_0}} \Bigr[ \nabla_\theta \log \pi_\theta (a_0|s_0) \cdot Q^{\pi_\theta}(s_0, a_0) \Bigr] + \mathbb{E}_{\tau \sim P^{\pi_\theta}_{s_0}} [\gamma \nabla_\theta V^{\pi_\theta} (s_1)]
\end{aligned}
$$
which is a recursion.
By solving $\nabla_\theta V^{\pi_\theta}(s_0), \nabla_\theta V^{\pi_\theta}(s_1), ...$, we can get
$$
\nabla_\theta V^{\pi_\theta}(\mu) = \mathbb{E}_{\tau \sim P^{\pi_\theta}_\mu} \Bigr[\sum_{t=0}^\infty \gamma^t Q^{\pi_\theta}(s_t, a_t) \nabla_\theta \log \pi_\theta (a_t|s_t) \Bigr]
$$
___
定義:
:::warning
**Discounted state visitation distribution (occupancy measure)**
$$
d^\pi_{s_0}(s) := (1- \gamma ) \sum_{t=0}^\infty \gamma^t P(s_t = s| s_o, \pi)
$$
:::
表示在策略 $\pi$ 下,從 $s_0$ 出發並訪問 $s$ 的機率。整個公式可以理解為:遵從 $\pi$ 並從 $s_0$ 出發,在所有可能的 timestep 下轉移到 $s$ 的機率加權和。
則訪問某狀態 $s$ 的機率可以表示為:
$$
d^\pi_\mu(s) := \mathbb{E}_ {s_0 \sim \mu} [d^\pi_{s_0}(s)]
$$
我們可以將 policy gradient 表示成:
:::info
**Q-value and discounted state visitation** (P3)
$$
\nabla_\theta V^{\pi_\theta}(\mu) = \frac{1}{1- \gamma} \mathbb{E}_{s \sim d^{\pi_\theta}_\mu} \mathbb{E}_{a \sim \pi_\theta(\cdot | s)} \Bigr[ Q^{\pi_\theta} (s, a) \nabla_\theta \log \pi_\theta (a|s) \Bigr]
$$
:::
此時 policy gradient 便和 trajectory 無關,只考慮每一個 $(s, a)$ pair。
統整以上提及的各種 Policy Gradient 表示法:
:::info
**Expression of Policy Gradient (Policy Gradient Theorem)**
***Total Reward***
$$
\nabla_\theta V^{\pi_\theta}(\mu) = \mathbb{E}_{\tau \sim P^{\pi_\theta}_\mu} \Bigr[R(\tau) \sum_{t=0}^\infty \nabla_\theta \log \pi_\theta (a_t|s_t) \Bigr]
$$
***REINFORCE***
$$
\nabla_\theta V^{\pi_\theta}(\mu) = \mathbb{E}_{\tau \sim P^{\pi_\theta}_\mu} \Bigr[\sum_{t=0}^\infty \gamma^t Q^{\pi_\theta}(s_t, a_t) \nabla_\theta \log \pi_\theta (a_t|s_t) \Bigr]
$$
***Q-value and discounted state visitation***
$$
\nabla_\theta V^{\pi_\theta}(\mu) = \frac{1}{1- \gamma} \mathbb{E}_{s \sim d^{\pi_\theta}_\mu} \mathbb{E}_{a \sim \pi_\theta(\cdot | s)} \Bigr[ Q^{\pi_\theta} (s, a) \nabla_\theta \log \pi_\theta (a|s) \Bigr]
$$
:::
## Score function
先前討論 policy gradient 時,我們直接假設了 $\pi_\theta (a|s)$ 可微。以下介紹一些可微的 policy classes 以及它們的 score function:
### Linear Softmax Policy
每個 action 的機率和 exponentiated weight 成比例。
$$
\pi_\theta(a|s) = \frac{\exp(\theta^T\phi_{s, a})}{\sum_{a^\prime \in A} \exp(\theta^T\phi_{s, a^\prime})}
$$
其 score function 為
$$
\nabla_\theta \log \pi_\theta (a|s) = \phi(s,a) - \mathbb{E}_{a \sim \pi_\theta(\cdot|s)} [\phi (s,a)]
$$
### Gaussian Policy
每個 action 是從一個 Gaussian distribution 抽取出來的:
$$
a \sim N(f_\theta(s), \sigma^2)
$$
Mean 是和 state features 的線性組合 $f(s) = \phi(s)^T \theta$,variance 可以是固定的 $\sigma^2$,也可以參數化。
其 score function 為
$$
\nabla_\theta \log \pi_\theta (a|s) = \frac{(a - \mu_\theta(s))\phi(s)}{\sigma^2}
$$
## Stochastic Policy Gradient
事實上,計算確切 policy gradient 的值是一件非常困難的事情,因為在 model-free 之下,我們並不知道實際上 trajectory 的分佈 $P^{\pi_\theta}_\mu$ 或是 state 的分佈 $d^\pi_\mu$ 為何。
因此,我們使用 sampling 來估計期望值:
$$
\begin{aligned}
\nabla_\theta V^{\pi_\theta}(\mu) &= \mathbb{E}_{\tau \sim P^{\pi_\theta}_\mu} \Bigr[R(\tau) \sum_{t=0}^\infty \nabla_\theta \log \pi_\theta (a_t|s_t) \Bigr] \\
&\approx R(\tau^\prime) \sum_{t=0}^\infty \nabla_\theta \log \pi_\theta (a_t^\prime|s_t^\prime) \\
\end{aligned}
$$
$g(\theta_k, \tau^\prime) = R(\tau^\prime) \sum_{t=0}^\infty \nabla_\theta \log \pi_\theta (a_t^\prime|s_t^\prime)$ 稱為 stochastic estimate of PG,且 $g(\theta_k, \tau^\prime)$ 是 exact PG $\nabla_\theta V^{\pi_\theta} (\mu)|_{\theta = \theta_k}$ 的不偏估計 (unbiased estimate)。
:::info
**Unbiased Estimate**
$x$ is an unbiased estimate of some deterministic quantity $\bar{x}$ if
$$
\mathbb{E}[x] = \bar{x}
$$
which is equivalent to
$$
x = \bar{x}+ \epsilon \ \text{where} \ \mathbb{E}[\epsilon] = 0
$$
:::
我們可以透過 sample 一或多個 trajectories 來估計 $\nabla_\theta V^{\pi_\theta}(\mu)$,例如 n 條 trajectories:
$$
\nabla_\theta V^{\pi_\theta}(\mu) \approx 1/n \sum_\tau R(\tau) \sum_{t=0}^\infty \nabla_\theta \log \pi_\theta (a_t|s_t)
$$
### Stochastic Gradient Descent
更一般化而言,定義 optimization 問題為:
$$
\theta^* = \arg\min_{\theta \in \Theta} F(\theta),\ \text{where} F(\theta) := \mathbb{E}_\xi [f(\theta;\xi)]
$$
其中 $\xi$ 代表此期望值下的隨機性,通常為未知的,則此期望值計算困難,因此我們可以使用 stochastic gradient descent:
$$
\begin{aligned}
\theta_{k+1} &= \theta_k - \eta_k \cdot \mathbb{E}[\nabla_\theta f(\theta_k;\xi_k)] \\
&\approx \theta_k - \eta_k \cdot g(\theta_k;\xi_k)
\end{aligned}
$$
SGD 通常會出現更具有隨機性的行為,可以視覺化如下:

這是因為每一次更新參數時,我們只做 sampling,導致 gradient descent 的每一步更新方向不穩定。
### Algorithms with SPG
REINFORCE policy gradient 定義如下:
$$
\nabla_\theta V^{\pi_\theta}(\mu) = \mathbb{E}_{\tau \sim P^{\pi_\theta}_\mu} \Bigr[\sum_{t=0}^\infty \gamma^t Q^{\pi_\theta}(s_t, a_t) \nabla_\theta \log \pi_\theta (a_t|s_t) \Bigr]
$$
若在 policy $\pi$ 下,每一回合迭代取一條 trajectory $\tau = (s_0, a_0, r_1, s_1, a_1, ...)$,則
$$
G_t(\tau) = \sum_{m=t}^\infty \gamma^m r_{m+1}
$$
且
$$
Q^{\pi_\theta}(s_t, a_t) = \mathbb{E}[G_t(\tau)]
$$
則我們可以得到 $\nabla_\theta V^{\pi_\theta}(\mu)$ 的不偏估計 $\hat{\nabla_\tau}$:
$$
\hat{\nabla_\tau} := \sum_{t=0}^\infty \gamma^t G_t(\tau) \nabla_\theta \log \pi_\theta (a_t|s_t)
$$
:::info
**REINFORCE Algorithm**
1. Initialize $\theta_0$ and step size $\eta$.
2. Sample a trajectory $\tau \sim P^{\pi_\theta}_\mu$ and make the update as
$$
\begin{aligned}
\theta_{k+1} &= \theta_k + \eta \cdot \hat{\nabla_\tau} \\
&= \theta_k + \eta\Bigr(\sum_{t=0}^\infty \gamma^t G_t(\tau) \nabla_\theta \log \pi_\theta (a_t|s_t)\Bigr)
\end{aligned}
$$
and repeat step 2 until termination.
:::
當然也可以用 total reward policy gradient 作為演算法內參數更新的方法:
:::info
**Algorithm with Total Reward Method**
1. Initialize $\theta_0$ and step size $\eta$.
2. Sample a trajectory $\tau \sim P^{\pi_\theta}_\mu$ and make the update as
$$
\begin{aligned}
\theta_{k+1} &= \theta_k + \eta \cdot \bar{\nabla_\tau} \\
&= \theta_k + \eta\Bigr(R(\tau) \nabla_\theta \log \pi_\theta (a_t|s_t) \Bigr)
\end{aligned}
$$
and repeat step 2 until termination.
:::
或者用 Q-value and discounted state visitation 作為演算法內更新參數的方法:
:::info
**Algorithm with Q-value and discounted state visitation**
1. Initialize $\theta_0$ and step size $\eta$.
2. Draw a batch $B$ of n state-action pairs from $d^{\pi_\theta}_\mu$ and $\pi_\theta$ and construct
$$
\tilde{\nabla_\tau} := \frac{1}{1-\gamma} \cdot (\frac{1}{n} \sum_{(s,a)\in B} Q^\pi_\theta(s,a) \nabla_\theta \log \pi_\theta (a|s))
$$
then apply
$$
\theta_{k+1} = \theta_k + \eta \cdot \tilde{\nabla_\tau}
$$
:::
值得注意的是,以 Q-value and discounted state visitation method 估計的 $\tilde{\nabla_\tau}$ 並不是 exact PG 的不偏估計!
想像 $d^{\pi_\theta}_\mu$ 是由所有 $\tau \sim \mu$ 中的每個 $(s,a)$ pair 所構成,則我們可以靠 sample 很多 trajectories 並把所有 $(s,a)$ pair 打亂丟進一個 buffer 內來逼近 $d^{\pi_\theta}_\mu$。但因為我們 sample 出來的 trajectories 必為有限個,因此這個 buffer 並不是真正的 $d^{\pi_\theta}_\mu$,故 $\tilde{\nabla_\tau}$ 是 biased 的。
不過只要 n 夠大,$\tilde{\nabla_\tau}$ 仍然可以逼近 exact PG。
___
上述使用到 sampling 來做估計的 policy gradient 方法被稱作 Monte-Carlo policy gradient。
## Variance Reduction
Monte-Carlo policy gradient 方法具有很高的 variance,意即我們需要很多步才可以得到較好的估計。
以 REINFORCE policy gradient 的一個簡單的例子來觀察這個現象:
假設有一個環境僅有一個 state,action $a$ 和 $b$ 皆會往 terminal state,而 reward 分別為 100 與 99。:

假設使用 softmax policy,因此 parameters 分別為 $\theta(s,a)$ 和 $\theta(s,b)$。Softmax policy 下 policy 定義為
$$
\pi_\theta(a|s) = \frac{\exp(\theta_{s, a})}{\sum_{a^\prime \in A} \exp(\theta_{s, a^\prime})}
$$
則
$$
\frac{\partial \log \pi_\theta(a|s)}{\partial \theta(s,a)} = 1 - \pi_\theta(a|s), \
\frac{\partial \log \pi_\theta(a|s)}{\partial \theta(s,b)} = - \pi_\theta(b|s) \\
\\
$$
$$
\frac{\partial \log \pi_\theta(b|s)}{\partial \theta(s,a)} = - \pi_\theta(a|s), \
\frac{\partial \log \pi_\theta(b|s)}{\partial \theta(s,b)} = 1 - \pi_\theta(b|s)
$$
假設此時 $\theta(s,a) = \theta(s,b) = 0$,因此 $\pi_\theta (a|s) = \pi_\theta(b|s) = 0.5$,則 true PG 為
$$
\begin{aligned}
\nabla_\theta V^{\pi_\theta}(\mu) &= 0.5\Bigr(\gamma Q(s,a) \nabla_\theta \log \pi_\theta(a|s)\Bigr) + 0.5 \Bigr(\gamma Q(s,b) \nabla_\theta \log \pi_\theta(b|s)\Bigr) \\
&= 0.5\Bigr(\gamma Q(s,a)
\left[
\begin{array}{ccc}
1 - \pi_\theta(s,a) \\
-\pi_\theta(s,b)
\end{array}
\right]
\Bigr) + 0.5 \Bigr(\gamma Q(s,b)
\left[
\begin{array}{ccc}
- \pi_\theta(s,a) \\
1 - \pi_\theta(s,b)
\end{array}
\right]
\Bigr)\\
&= 0.5\Bigr(1 \cdot 100 \cdot
\left[
\begin{array}{ccc}
0.5 \\
-0.5
\end{array}
\right]
\Bigr) + 0.5 \Bigr(1 \cdot 99 \cdot
\left[
\begin{array}{ccc}
-0.5 \\
0.5
\end{array}
\right]
\Bigr)\\
&=
\left[
\begin{array}{ccc}
0.25 \\
-0.25
\end{array}
\right]
\end{aligned}
$$
但如果我們做 sampling,同一時間只有一個 action 會被選擇,則我們估計出來的 $\hat{\nabla_\theta}$ 可能為
$$
\begin{equation}
\hat{\nabla_\theta} =
\left\{ \begin{aligned}
100 \cdot
\left[
\begin{array}{ccc}
0.5 \\
-0.5
\end{array}
\right] \ \text{if action a is sampled} \\
99 \cdot
\left[
\begin{array}{ccc}
-0.5 \\
0.5
\end{array}
\right] \ \text{if action b is sampled}
\end{aligned} \right.
\end{equation}
$$
下圖比較 true gradient 和兩個 sample gradient,可以發現不同 sample gradient 的差異會非常大:

這正是為何使用 sampling 來做估計的話會有很大的 variance。
我們可以透過下列幾個方法來進行 variance reduction:
1. Baseline
2. Critic
3. Advantage function (baseline + critic)
### Baseline
Baseline 的概念是希望可以透過對 $Q^{\pi_\theta}(s, a)$ 減去一個 baseline function $B(s)$ 來達到減少 variance 的效果,以 REINFORCE 為例:
$$
\mathbb{E}_{\tau \sim P^{\pi_\theta}_\mu} \Bigr[\sum_{t=0}^\infty \gamma^t \bigr(Q^{\pi_\theta}(s_t, a_t) - B(s_t)\bigr) \nabla_\theta \log \pi_\theta (a_t|s_t) \Bigr]
$$
減去 $B(s)$ 並不會影響期望值,我們可以靠證明 $\mathbb{E}_{\tau \sim P^{\pi_\theta}_\mu} [B(s_t) \nabla_\theta \log \pi_\theta(a_t|s_t)] = 0$ 得到此結論:
$$
\begin{aligned}
\mathbb{E}_{\tau \sim P^{\pi_\theta}_\mu} [B(s_t) \nabla_\theta \log \pi_\theta(a_t|s_t)] &= \sum_s P(s_t = s) \sum_a \pi_\theta (a|s) \nabla_\theta \log \pi_\theta(a|s) B(s) \\
&= \sum_s P(s_t = s) B(s) \nabla_\theta \sum_a \pi_\theta (a|s) \ \ \ \small\text{//by chain rule} \\
&= \sum_s P(s_t = s) B(s) \nabla_\theta 1 \\
&= 0 \\
\end{aligned}
$$
:::info
**REINFORCE with baseline**
1. Initialize $\theta_0$ and step size $\eta$.
2. Sample a trajectory $\tau \sim P^{\pi_\theta}_\mu$ and make the update as
$$
\begin{aligned}
\theta_{k+1} &= \theta_k + \eta \cdot \hat{\nabla_\tau} \\
&= \theta_k + \eta\Bigr(\sum_{t=0}^\infty \gamma^t \bigr(G_t-B(s_t)\bigr) \nabla_\theta \log \pi_\theta (a_t|s_t)\Bigr)
\end{aligned}
$$
and repeat step 2 until termination.
:::
基於減少 variance 的目的,常會取 $B(s) = V^{\pi_\theta} (s)$ (證明略)。
### Critic
這個方法的想法是希望可以學一個 **critic** 來估計 action-value function:
$$
Q_w(s,a) \approx Q^{\pi_\theta}(s,a)
$$
此類方法統稱為 actor-critic algorithms:
* Critic:更新 action-value function 的參數 $w$
* Actor:更新 policy 的參數 $\theta$,且更新方向是由 critic 所建議的
以 Q-value and discounted state visitation 為例,此時 approximate policy gradient 為
$$
\nabla_\theta V^{\pi_\theta}(\mu) \approx \frac{1}{1- \gamma} \mathbb{E}_{s \sim d^{\pi_\theta}_\mu} \mathbb{E}_{a \sim \pi_\theta(\cdot | s)} \Bigr[ Q_w (s, a) \nabla_\theta \log \pi_\theta (a|s) \Bigr]
$$
以此為例,一個基於 Q-function critic 的 actor-critic algorithm 可以如下:
:::info
**Q-Value Actor-Critic**
1. Initialize $\theta, w$, step size $\eta, s_0$ and sample $a_0 \sim \pi_\theta$
2. For each step $t=0, 1, 2, ...$
Sample reward $r_{t+1}$, transition $s_{t+1}$, action $a_{t+1} \sim \pi_\theta(s_{t+1}, a_{t+1})$
$\theta \leftarrow \theta + \eta Q_w(s_t, a_t) \nabla_\theta \log \pi_\theta(a_t|s_t)$
Update $w$ for $Q_w(s, a)$ (possibly using $r_{t+1}, s_{t+1}, a_{t+1}$)
:::
### Advantage Function
此方法同時使用了 baseline 和 critic。
:::warning
**Advantage function**
$$
A^{\pi_\theta}(s,a) = Q^{\pi_\theta}(s,a) - V^{\pi_\theta}(s)
$$
:::
則我們可以改寫 (P3) 為
$$
\nabla_\theta V^{\pi_\theta}(\mu) = \frac{1}{1- \gamma} \mathbb{E}_{s \sim d^{\pi_\theta}_\mu} \mathbb{E}_{a \sim \pi_\theta(\cdot | s)} \Bigr[ A^{\pi_\theta} (s, a) \nabla_\theta \log \pi_\theta (a|s) \Bigr]
$$
:::info
**Policy Gradient with Advantage** (P4)
$$
\nabla_\theta V^{\pi_\theta}(\mu) = \frac{1}{1- \gamma} \mathbb{E}_{s \sim d^{\pi_\theta}_\mu} \mathbb{E}_{a \sim \pi_\theta(\cdot | s)} \Bigr[ A^{\pi_\theta} (s, a) \nabla_\theta \log \pi_\theta (a|s) \Bigr]
$$
**REINFORCE with Advantage**(P5)
$$
\nabla_\theta V^{\pi_\theta}(\mu) = \mathbb{E}_{\tau \sim P^{\pi_\theta}_\mu} \Bigr[\sum_{t=0}^\infty \gamma^t A^{\pi_\theta}(s_t, a_t) \nabla_\theta \log \pi_\theta (a_t|s_t) \Bigr]
$$
:::
## Next
下一篇筆記:
https://hackmd.io/@tsen159/RLNote3
內容為 Model-Free Prediction。