$\nabla\mathbb{E}_{\hat{s}~\sim \pi_{\theta}}[R(\hat{s})]$ $=\mathbb{E}_{\hat{s}~\sim \pi_{\theta}}[R(\hat{s})\nabla_{\theta}\log \pi_{\theta}(\hat{s})]$ $=\mathbb{E}_{\hat{s}~\sim \pi_{\theta}}[\hat{A}_{\theta}\nabla_{\theta}\log \pi_{\theta}(\hat{s})]$ $=\mathbb{E}_{\hat{s}~\sim \pi_{\theta}}[\frac{\pi_\theta(a_t, s_t)}{\pi_{\theta_{old}}(a_t, s_t)}\hat{A}_{\theta_{old}}\nabla_{\theta}\log \pi_{\theta}(\hat{s})]$ (consider importance sampling) $=\mathbb{E}_{\hat{s}~\sim \pi_{\theta}}[\frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)} \frac{\pi_\theta(s_t)}{\pi_{\theta_{old}}(s_t)}\hat{A}_{\theta_{old}}\nabla_{\theta}\log \pi_{\theta}(\hat{s})]$ $\approx \mathbb{E}_{\hat{s}~\sim \pi_{\theta}}[\frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)} \hat{A}_{\theta_{old}}\nabla_{\theta}\log \pi_{\theta}(\hat{s})]$ New objective function: $\mathbb{E}_{\hat{s}~\sim \pi_{\theta}}[R(\hat{s})] \approx \mathbb{E}_{\hat{s}~\sim \pi_{\theta}}[\frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)} \hat{A}_{\theta_{old}}]$ 考慮到 $\theta$ 和 $\theta_{old}$ 差異不要太大: $\mathbb{E}_{\hat{s}~\sim \pi_{\theta}}[\frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)} \hat{A}_{\theta_{old}}] - \beta KL(\theta, \theta_{old})$