---
hackmd:
url: https://hackmd.io/6MLuha84QoWYXYh6A6tqRA
title: Policy-based Methods(Policy Gradient Family)
lastSync: 2025-12-12T10:07:07.312Z
---
# Policy-based Methods(Policy Gradient Family)
## Core Idea - Policy Gradient
直接學習策略 $\pi_\theta(a|s)$,透過最大化期望回報。
### 目標函數
我們希望最大化策略的期望回報:
- 回合制 (Episodic):
$$J(\theta) = V^{\pi_\theta}(s_0) = \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ \sum_{t=0}^{T} r_t \right]$$
- 平均價值 (continuing):
$$J_{avV}(\theta) = \sum_s d^{\pi_\theta}(s) V^{\pi_\theta}(s)$$
- 平均獎勵:
$$J_{avR}(\theta) = \sum_s d^{\pi_\theta}(s) \sum_a \pi_\theta(a|s) R_s^a$$
這些都是用來衡量策略好壞的 value-like 指標。
### 策略梯度 (單步)
考慮一類簡單的 單步 MDP (one-step MDPs)
從狀態 $s_0 \sim d(s)$ 開始
在一個時間步後終止,獲得獎勵 $r = R_{s,a}$
使用 likelihood ratios 來計算策略梯度:
$$J_0(\theta) = V^{\pi_\theta}(s_0) = \mathbb{E}_{\pi_\theta}[r] = \sum_{a \in A} \pi_\theta(s_0, a) R_{s_0,a}$$
$$
\begin{aligned}
\nabla_\theta J_0(\theta) &= \sum_{a \in A} \nabla_\theta \pi_\theta(s_0, a) R_{s_0,a} \\
&= \sum_{a \in A} \pi_\theta(s_0, a) \nabla_\theta \log \pi_\theta(s_0, a) R_{s_0,a} \\
&= \mathbb{E}_{\pi_\theta} [\nabla_\theta \log \pi_\theta(s, a) \cdot r]
\end{aligned}
$$
註:$\nabla_\theta \log \pi_\theta(s_0, a)$ 被稱為分數函數 (Score function)。
令 $s_0 \sim d(s)$:
$$
\begin{aligned}
J(\theta) &= \mathbb{E}_{d(s), \pi_\theta}[r] \\
&= \sum_{s \in S} d(s) \sum_{a \in A} \pi_\theta(s, a) R_{s,a}
\end{aligned}$$
$$\nabla_\theta J(\theta) = \mathbb{E}_{d(s), \pi_\theta} [\nabla_\theta \log \pi_\theta(s, a) \cdot r]
$$
### Policy Gradient Theorem
- 將 likelihood ratio approach 推廣到 multi-step MDPs。
- 將即時獎勵 $r$ 替換為長期價值 $Q^\pi(s, a)$。
對於任何可微分的策略和任何目標函數 $J(\theta)$:
$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} [\nabla_\theta \log \pi_\theta(s, a) Q^{\pi_\theta}(s, a)]$$
關鍵性質: 梯度僅取決於策略和 Q,而不取決於環境轉移。
:::info
利用 $\theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta_t)$ 更新梯度向上。
:::
## Method Evolution Line
### REINFORCE
- 使用 policy gradient theorem
- 透過 stochastic gradient ascent 更新參數
- 使用回報 (return) $G_t$ 作為 $Q^{\pi_\theta}(s_t, a_t)$ 的無偏樣本
$$\Delta \theta_t = \alpha \nabla_\theta \log \pi_\theta(s_t, a_t) \cdot G_t$$
如果 $G_t$ 很大,$\Delta \theta_t$ 會朝著 分數函數 (score function) 的方向移動更多。
* 使用 MC return: $G_t$
* 高 variance
### Baseline Trick
* 使用: $A = G - V$
* 降低 variance
### Actor-Critic
#### Critic - TD error
$$
L_Q(s_t, a_t|\omega) = \frac{1}{2} \left( \underbrace{r_{t+1} + \gamma Q(s_{t+1}, a'|\omega)}_{\text{Target } y_t} - Q(s_t, a_t|\omega) \right)^2
$$
for $\omega$ 是 Critic 參數。
- Gradient
$$
\nabla_\omega L_Q(s_t, a_t|\omega) = - \left( (r_{t+1} + \gamma Q(s_{t+1}, a'|\omega)) - Q(s_t, a_t|\omega) \right) \nabla_\omega Q(s_t, a_t|\omega)
$$
#### Actor
The Actor updates the policy $\pi_\theta$ in the direction suggested by the Critic to maximize expected return.
- Objective
The objective is to maximize the expected value:
$$
J(\theta) = \mathbb{E}_{s,a}^{\pi_\theta}[Q(s, a|\omega)]
$$
- Gradient (Approximate Policy Gradient)
$$
\nabla_\theta J(\theta) = \mathbb{E}_{s,a}^{\pi_\theta} \left[ {\nabla_\theta \log \pi(s_t, a_t|\theta)} \cdot Q(s_t, a_t|\omega) \right]
$$
### Advantage-based Methods
To reduce variance without changing the expectation, we can subtract a baseline function $B(s)$ from the policy gradient. A good baseline is the state value function $B(s) = V^{\pi_\theta}(s)$.
Subtracting a baseline works because the term integrates to zero:
$$\mathbb{E}_{\pi_\theta} [\nabla_\theta \log \pi_\theta(s, a) B(s)] = \sum_{s \in S} d^{\pi_\theta}(s) B(s) \nabla_\theta \underbrace{\left( \sum_{a \in A} \pi_\theta(s, a) \right)}_{=1 \text{ (constant)}} = 0$$
We can therefore rewrite the policy gradient using the Advantage Function $A^{\pi_\theta}(s, a)$:
$$A^{\pi_\theta}(s, a) = Q^{\pi_\theta}(s, a) - V^{\pi_\theta}(s)$$
The updated gradient becomes:
* 真正的 Advantage: $A = Q - V$
* further reduce variance
### Advantage AC
In a practical Actor-Critic algorithm (like A2C or A3C), we usually do not learn $Q$ and $V$ as two separate neural networks, because that is inefficient. Instead, we use the Temporal Difference (TD) Error to estimate the Advantage using only the Value function $V(s)$.
Recall the Bellman equation: $Q(s, a) \approx r + \gamma V(s')$. If we substitute this into your Advantage equation, we get the TD Error ($\delta$):
$$
\delta = \underbrace{r + \gamma V_\phi(s')}_{\text{estimate of } Q(s, a)} - V_\phi(s)
$$
- Critic Update (minimize MSE):
$$
\mathcal{L}_{critic} = \left( (r + \gamma V_\phi(s')) - V_\phi(s) \right)^2
$$
$$
\mathcal{L}_{critic} = \left( \delta \right)^2
$$
- Actor Update (maximize Advantage):
$$
\nabla_\theta J(\theta) \approx \nabla_\theta \log \pi_\theta(s, a) \underbrace{(r + \gamma V_\phi(s') - V_\phi(s))}_{\text{Estimate of } A(s,a)}
$$
$$
\nabla_\theta J(\theta) \approx \nabla_\theta \log \pi_\theta(s, a) \cdot \delta
$$
:::info
這個關係直接來自回報 $G_t$ 的遞迴定義:
展開回報定義:
$$
G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots
$$
遞迴形式:
提出第一項 $R_{t+1}$ (即 $r$),將剩餘部分提出 $\gamma$:
$$
G_t = R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} + \dots) = R_{t+1} + \gamma G_{t+1}
$$
取期望值:
$Q(s, a)$ 是在 $s$ 做 $a$ 之後 $G_t$ 的期望。
$V(s')$ 是在 $s'$ 之後 $G_{t+1}$ 的期望。
將上述帶入方程:
$$
\underbrace{\mathbb{E}[G_t | s, a]}_{Q(s, a)} = \mathbb{E}[R_{t+1} | s, a] + \gamma \underbrace{\mathbb{E}[G_{t+1} | s']}_{V(s')}
$$
這簡化為:
$$
Q(s, a) = \mathbb{E}_{s' \sim P} [r + \gamma V(s')]
$$
:::
### GAE
解決 Bias-Variance Trade-off 的技巧。
$$
A^{GAE}_t = \sum_{l=0}^\infty (\gamma\lambda)^l \cdot \delta_{t+l}
$$
$$
A^{GAE}_t = (r_t + \gamma V(s_{t+1}) - V(s_t)) + (\gamma\lambda)(r_{t+1} + \gamma V(s_{t+2}) - V(s_{t+1})) + \dots
$$
- Key Interpretations
This expansion highlights how $\lambda$ controls the trade-off between bias and variance:
If $\lambda = 0$: The higher-order terms vanish. The expansion collapses to just the first term:
$$
A^{GAE}_t = \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)
$$
(This is high bias, low variance).
\
If $\lambda = 1$:
The decay depends only on $\gamma$. The expansion becomes the sum of all future TD errors, which mathematically telescopes to the actual Monte Carlo return minus the value function:
$$
A^{GAE}_t = \sum_{l=0}^\infty \gamma^l \delta_{t+l} = \left( \sum_{l=0}^\infty \gamma^l r_{t+l} \right) - V(s_t)
$$
(This is low bias, high variance).
> 訣竅:參數 $\lambda$ 控制權衡。$\lambda=1$ 類似 MC (高 Variance),$\lambda=0$ 類似 TD(0) (高 Bias)。
:::warning
### Summary of Policy Gradient Algorithms
The policy gradient has many equivalent forms:
$$
\begin{aligned}
\nabla_\theta J(\theta) &= \mathbb{E}_{\pi_\theta} [\nabla_\theta \log \pi_\theta(s, a) \, G_t] & \text{(REINFORCE)} \\
&= \mathbb{E}_{\pi_\theta} [\nabla_\theta \log \pi_\theta(s, a) \, Q^w(s, a)] & \text{(TD Actor-Critic)} \\
&= \mathbb{E}_{\pi_\theta} [\nabla_\theta \log \pi_\theta(s, a) \, \delta] & \text{(Advantage Actor-Critic)} \\
&= \mathbb{E}_{\pi_\theta} [\nabla_\theta \log \pi_\theta(s, a) \, A^{(k)}] & \text{((k-step) Advantage Actor-Critic)} \\
&= \mathbb{E}_{\pi_\theta} [\nabla_\theta \log \pi_\theta(s, a) \, A^{GAE}] & \text{((\lambda) Advantage Actor-Critic)}
\end{aligned}
$$
Each leads to a stochastic gradient ascent algorithm.
:::
### TRPO / PPO
==A policy optimization algorithm==
- TRPO:在更新時加入 KL Divergence 的硬限制 constraint。
- PPO:TRPO 的簡化版,直接在目標函數中限制更新幅度 (Clipping)。
* PPO 目標:
$$
L^{CLIP}(\theta) = \mathbb{E} \left[ \min(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)A_t) \right]
$$
:::warning
PPO 的核心精神
不要讓新舊策略差異過大 ($r_t(\theta)$ 偏離 1 太多),以確保訓練的穩定性與單調提升。
:::
### TRPO
Starting point:
$$
\eta(\tilde{\pi}) = \eta(\pi) + \mathbb{E}_{s_0, a_0, \dots \sim \tilde{\pi}} \left[ \sum_{t=0}^{\infty} \gamma^t A_{\pi}(s_t, a_t) \right]
$$
This implies that we can derive "return of new policy" from "advantage of old policy".
#### State Visitation Density
We define the ==un-normalized discounted visitation frequencies== $\rho_{\tilde{\pi}}(s)$ as:
$$
\rho_{\tilde{\pi}}(s) := \sum_{t=0}^{\infty} \gamma^t P(s_t = s | \tilde{\pi})
$$
Using this, the return of the new policy $\tilde{\pi}$ can be rewritten as:
$$
\eta(\tilde{\pi}) = \eta(\pi) + \sum_{s} \rho_{\tilde{\pi}}(s) \sum_{a} \tilde{\pi}(a|s) A_{\pi}(s, a)
$$
##### Monotonic Improvement Guarantee
This reformulation implies:
Any policy update $\pi \rightarrow \tilde{\pi}$ that has a non-negative expected advantage at every state $s$ is guaranteed to increase the policy performance $\eta$.
Specifically, if $\sum_a \tilde{\pi}(a|s) A_{\pi}(s, a) \ge 0$ for all states, then $\eta(\tilde{\pi}) \ge \eta(\pi)$.
#### Local Approximation & First Order Match
Directly optimizing $\eta$ is difficult due to the complex dependency of $\rho_{\tilde{\pi}}(s)$ on $\tilde{\pi}$. TRPO introduces a local approximation $L_{\pi}(\tilde{\pi})$:
$$
L_{\pi}(\tilde{\pi}) = \eta(\pi) + \sum_{s} \rho_{\pi}(s) \sum_{a} \tilde{\pi}(a|s) A_{\pi}(s, a)
$$
If the policy $\pi_{\theta}$ is differentiable w.r.t $\theta$, then $L_{\pi}$ matches $\eta$ to the first order:
- $L_{\pi_{\theta_{old}}}(\pi_{\theta_{old}}) = \eta(\pi_{\theta_{old}})$
- $\nabla_{\theta} L_{\pi_{\theta_{old}}}(\pi_{\theta}) |_{\theta=\theta_{old}} = \nabla_{\theta} \eta(\pi_{\theta}) |_{\theta=\theta_{old}}$
Implication: A step small enough that improves $L_{\pi_{old}}$ will also improve $\eta$.
#### Theoretical Bound & MM Algorithm
The main theoretical result provides a lower bound on the improvements
$$
\eta(\tilde{\pi}) \ge L_{\pi_{old}}(\tilde{\pi}) - C \cdot D_{KL}^{max}(\pi_{old}, \tilde{\pi})
$$
Where $C = \frac{4\epsilon\gamma}{(1-\gamma)^2}$ is a penalty coefficient based on the maximum advantage.
This leads to a **Monotonic Improvement** guarantee (MM Algorithm):
1. Define a lower bound function $M_i(\pi) = L_{\pi_i}(\pi) - C \cdot D_{KL}^{max}(\pi_i, \pi)$.
2. Maximizing $M_i(\pi)$ guarantees $\eta(\pi_{i+1}) \ge \eta(\pi_i)$.

#### From Penalty to Trust Region Constraint
In practice, the theoretical penalty coefficient $C$ is too large, resulting in very small step sizes.
To take larger steps robustly, TRPO replaces the penalty with a hard constraint on the KL divergence, forming a **Trust Region**:
$$
\max_{\theta} L_{\theta_{old}}(\theta)$$$$\text{subject to } D_{KL}^{max}(\theta_{old}, \theta) \le \delta
$$
#### Practical Implementation (Approximation)
To make this computationally feasible, we transform the objective:
1. **Expectation:** Replace sum over states $\sum_s \rho_{\theta_{old}}(s)[\dots]$ with expectation $\frac{1}{1-\gamma} \mathbb{E}_{s \sim \rho_{\theta_{old}}}[\dots]$.
2. **Importance Sampling:** Replace sum over actions with an importance sampling estimator:
$$\sum_a \pi_{\theta}(a|s_n) A_{\theta_{old}}(s_n, a) = \mathbb{E}_{a \sim \pi_{\theta_{old}}(a|s_n)} \left[ \frac{\pi_{\theta}(a|s_n)}{\pi_{\theta_{old}}(a|s_n)} A_{\theta_{old}}(s_n, a) \right]$$
#### Final Surrogate Objective
The final problem solved in practice utilizes a **Surrogate Objective** $L^{CPI}$ (Conservative Policy Iteration):
$$L^{CPI}(\theta) = \hat{\mathbb{E}}_t \left[ \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} \hat{A}_t \right]$$
Subject to a constraint on the average KL divergence: $\mathbb{E}[D_{KL}] \le \delta$.
### PPO
Problems of TRPO:
- Still relatively complicated
- Not compatible with architectures that include noise (such as dropout) or parameter sharing
PPO simplifies TRPO by removing the complex KL constraint and instead penalizing changes to the policy that move the probability ratio too far

#### Probability Ratio
Let $r_t(\theta)$ denote the probability ratio (not reward):
$$
r_t(\theta) = \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}
$$
Note: $r(\theta_{old}) = 1$.
#### PPO Clip Objective
To penalize large policy updates, PPO uses the following objective:
$$L^{CLIP}(\theta) = \hat{\mathbb{E}}_t \left[ \min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t) \right]$$
- $\epsilon$ is a hyper-parameter (e.g., 0.2).
- **Term 1:** $r_t(\theta)\hat{A}_t$ is the original $L^{CPI}$.
- **Term 2:** The "clipped" version prevents the ratio from growing too large (if advantage is positive) or too small (if advantage is negative).
##### Gradient Behavior
- If the ratio is clipped, $\nabla_{\theta} L^{CLIP}$ becomes 0, effectively dropping the gradient for that sample to prevent destructive large updates.
### Total Objective (PPO Architecture)
PPO often uses a single network with a shared embedding layer for both the policy and the value function.
**Total Objective to Maximize:**
$$L_t^{CLIP+VF+S}(\theta) = \hat{\mathbb{E}}_t [ L_t^{CLIP}(\theta) - c_1 L_t^{VF}(\theta) + c_2 S[\pi_{\theta}](s_t) ]$$
1. **Policy Loss:** $L_t^{CLIP}(\theta)$.
2. **Value Loss:** $L_t^{VF}(\theta) = (V_{\theta}(s_t) - V_t^{target})^2$.
- Estimates the value of the current state via TD-like learning.
3. **Entropy Bonus:** $S[\pi_{\theta}](s_t)$.
- Augments the objective to ensure sufficient exploration.
- For the large value, the action distribution is close to uniform distribution.
:::warning
#### The meaning of the entropy compared with SAC is different.
Here, it is a nudge. That make the agent have exploration, but it is not ultimate goal, just regulation term instead. But in SAC, it is in the reward, which means it force agent to learn how to play with exploration consideration.
:::
### Deterministic PG → DDPG → TD3
#### DDPG
策略不再輸出機率分佈,而是直接輸出動作值 $a = \mu_\theta(s)$。
其更新與模型架構與 AC 相同
1. **Critic 的梯度**:Q 值對動作 $a$ 的變化率。
2. **Actor 的梯度**:動作 $a$ 對參數 $\theta$ 的變化率,應用 chain rule 得到
$$
\begin{aligned} \nabla_\theta J(\mu_\theta) &\approx \mathbb{E}_{\mu} [\nabla_\theta Q(s_t, \mu(s_t|\theta)|\omega)] \\ &= \mathbb{E}_{\mu} [\nabla_a Q(s_t, a|\omega)|_{a=\mu(s_t|\theta)} \nabla_\theta \mu(s_t|\theta)] \end{aligned}
$$
這意味著 Actor 的更新方向,就是讓 Critic 評估出的 Q 值最大的方向。
兩個關鍵特色
- **Experience Replay**:使用 Replay Buffer 儲存 $(s, a, r, s')$,打破資料的時間相關性。
- **Soft Target Update**:Target Network 的參數 $\theta'$ 不是直接複製,而是緩慢更新 ($\tau \ll 1$):
$$
\theta' \leftarrow \tau\theta + (1-\tau)\theta'
$$
儘管如此,DDPG 仍深受 **Overestimation (價值高估)** 的困擾,這也是 TD3 誕生的主因。
#### TD3 (Twin Delayed DDPG)
DDPG 的改良版,引入了三個技巧來解決 Overestimation
- Clipped Double Q-learning:雙 Q 網路取最小值
- 以保守 underestimation 換取不要 overestimation。
- Target Policy Smoothing:在 Target Action 加入雜訊
- 鄰近的連續值應該有相同表現。
- Critic 對於動作值的估計可能不平滑,Actor 容易利用這些錯誤的尖峰 (Sharp peaks) 來獲得虛高分數。
$$
\tilde{a} \leftarrow \mu_{\phi'}(s') + \epsilon, \quad \epsilon \sim \text{clip}(\mathcal{N}(0, \sigma), -c, c)
$$
- Delayed Policy Update:Critic 更新多次後才更新一次 Actor
- Critic 穩定後才更新 Actor。
### Entropy-regularized PG → SAC
與 DDPG/TD3 最大的不同在於它採用了 **Stochastic Policy (隨機策略)**,並且引入了 **Maximum Entropy Reinforcement Learning (最大熵強化學習)** 的框架。
機制:在目標函數中加入 Entropy $H(\pi)$,鼓勵探索。
目標函數:
$$J(\pi) = \mathbb{E}[Q(s,a) + \alpha H(\pi(\cdot|s))]$$
Policy Update:
$$\alpha \log \pi(a|s) - Q(s,a)$$
優點:避免策略過早收斂至局部最佳解 (Collapse),具有極強的 Robustness。
#### SAC (Soft Actor-Critic)
##### Value Network ($V_{\psi}$)
Estimates the "Soft Value" of a state.
$$J_V(\psi) = \mathbb{E}_{s_t \sim D} \left[ \frac{1}{2} (V_{\psi}(s_t) - \hat{V}_{\psi}(s_t))^2 \right]$$
Where the target $\hat{V}_{\psi}$ is the expected Soft Q-value minus entropy:
$$\hat{V}_{\psi}(s_t) = \mathbb{E}_{a_t \sim \pi_{\phi}} [Q_{\theta}(s_t, a_t) - \alpha \log \pi_{\phi}(a_t|s_t)]$$
- Trained by minimizing squared residual error (TD error).
##### 2. Q-Value Network ($Q_{\theta}$)
Estimates the Soft Q-Value.
$$J_Q(\theta) = \mathbb{E}_{(s_t, a_t) \sim D} \left[ \frac{1}{2} (Q_{\theta}(s_t, a_t) - \hat{Q}_{\theta}(s_t, a_t))^2 \right]$$
Where the target $\hat{Q}_{\theta}$ includes the next state value:
$$\hat{Q}_{\theta}(s_t, a_t) = r(s_t, a_t) + \gamma \mathbb{E}_{s_{t+1} \sim p} [V_{\psi}(s_{t+1})]$$
- Trained by minimizing soft Bellman residual error.
##### Policy Network ($\pi_{\phi}$)
The policy (actor) is trained to maximize the expected Soft Q-value.
**Objective (Single-Step Actor Update):** We want to maximize $\mathbb{E}_{a \sim \pi}[Q_{\phi}(s, a) + \alpha \mathcal{H}(\pi(\cdot|s))]$. Expanding entropy, this equates to maximizing:
$$\mathbb{E}_{a \sim \pi}[Q_{\phi}(s, a) - \alpha \log \pi(a|s)]$$
**Loss Function (Minimizing KL Divergence):**
$$J_{\pi}(\phi) = \mathbb{E}_{s_t \sim D, \epsilon_t \sim N} [\alpha \log \pi_{\phi}(f_{\phi}(\epsilon_t; s_t)|s_t) - Q_{\theta}(s_t, f_{\phi}(\epsilon_t; s_t))]$$
- **Reparameterization Trick:** To make the distribution differentiable, we sample action $a_t$ from a fixed distribution (like a spherical Gaussian) and transform it:
$$
a_t = f_{\phi}(\epsilon_t; s_t)
$$
Where $\epsilon_t$ is a noise vector.
