## On-policy v.s. Off-policy
RL算法中的策略類型$\pi(s)$分為兩種
- 確定性策略$\pi(s)$:將狀態空間的函數映射到動作空間 S→A
- 隨機性策略$\pi( A_t \mid S_t)$:則是狀態$S_t$情況下所對應的高斯動作分佈區間,本身帶有隨機性
#### off-policy : the learning is from the data off the target policy
我認為直觀說明就是 off-policy 即是訓練資料和當下決策是不同步的
以Q-Learning來說
很明顯更新policy網路的$S',A'$都不是當下$S_t$產生的動作與狀態
因此可以看出Q-Learning是很典型的off-policy
---
### psudo code 示意:
:::info
Q-learning (off-policy TD control) for estimating $\pi$ $\approx$ $\pi_*$
Algorithm parameters: step size $a \in (0, 1]$, small $\varepsilon > 0$
Initialize Q(s,a), for all s $\in S^+$,a $\in$ A(s), arbitrarily except that Q(terminal,)=0
Loop for each episode:
Initialize S
Loop for each step of episode:
Choose A from S using policy derived from Q (e.g., -greedy)
Take action A, observe R, S
Q(S, A) ← Q(S, A) + $\alpha [R + $\gamma$ $max_a$ Q(S', a) - Q(S, A)] S < S$
until S is terminald ==
:::
---
#### 而PPO則是on-policy:即是 the target and the behavior are the same
PPO算法的相關細節請參考[[RL] Proximal Policy Optimization(PPO)](/HZ_LnjtgRoSskA7oLWBBUg)