---
title: Policy Gradient
tags: cs 593 rl
---
Policy gradient: $\pi_{t+1} = Proj(\pi_t + \alpha \nabla_\pi \eta(\pi)|_{\pi_t})$, $\eta$ is some objective function
### Undiscounted episodic MDP
$\eta(\pi) = E_\pi[\sum_{h=1}^H r_h]$
$\psi(\tau) = \sum_{h=1}^H{r_h}, \tau = (X_1, A_1, ... X_H, A_H)$
$Pr(\tau, \pi) = P_1(X_1)\pi(A_1, X_1)...P(X_H, A_{H-1}, X_{H-1})\pi(A_H, X_H)$
$\nabla_\pi Pr(\tau, \pi) = \nabla_\pi \sum_h \pi(A_h, X_h)$
$\nabla_\pi \eta(\pi) = \int_\tau \nabla_\pi Pr(\tau, \pi)\psi(\tau)d\tau$
$= \int_\tau \frac{Pr(\tau, \pi)}{Pr(\tau, \pi)}\nabla_\pi Pr(\tau, \pi)\psi(\tau)d\tau$
$= \int_\tau Pr(\tau, \pi) \nabla_\pi \log(Pr(\tau, \pi)) \psi(\tau)d\tau$
$= E[\nabla_\pi \log(Pr(\tau, \pi)) \psi(\tau)]$
$= E[\nabla_\pi \sum_h \pi(A_h, X_h) \psi(\tau)]$
Given $\{\tau^t\}_{t=1}^T$ trajectories,
$\nabla_{\pi} \hat \eta(\pi) = \frac{1}{T} \sum_{t=1}^T (\nabla_\pi \sum_h \pi(A_h^t, X_h^t)) \cdot \psi(\tau^t)$
### Discounted infinite horizon MDP
$V_{\pi}(X_1) = E_\pi[\sum_{t=1}^t \gamma^t r_t | \sigma(X_1)]$
$Q_{\pi}(X_1, A_1) = E_\pi[\sum_{t=1}^t \gamma^t r_t | \sigma(X_1, A_1)]$
We know that $V_\pi(X_1) = E_\pi[Q_\pi (X_1, A_1) | \sigma(X_1)]$
In Euclidean spaces, $V_\pi(X_1) = \int_\mathcal A \pi(A_1, X_1)Q_\pi(X_1, A_1)da$
$\nabla V_\pi(x) = \int_\mathcal A \nabla_\pi \pi(a, x)Q_\pi(x, a) da + \int_\mathcal A \pi(a, x) \nabla_\pi Q_\pi(x, a) da$
$= \int_\mathcal A \nabla_\pi \pi(a, x)Q_\pi(x, a) da + \int_\mathcal A \pi(a, x) \nabla_\pi (R(x, a) + \gamma \int_\mathcal X P(x'|x,a)V(x')dx') da$
$= \int_\mathcal A \nabla_\pi \pi(a, x)Q_\pi(x, a) da + \gamma \int_\mathcal A \pi(a, x) \nabla_\pi \int_\mathcal X P(x'|x,a)V(x')dx'da$
Given occupancy kernel or measure: $\bar \mu_\pi^\gamma (S, x) = \lim_{T\to\infty} \frac{1}{1 - \gamma}E_\pi[\sum_{t=1}^T \gamma^{t-1}I(x_t \in S) | \sigma(x_1 = x)]$
$\mu_\pi^\gamma = d\bar\mu(\cdot, x) = \mu_\pi^\gamma(x', x)dx'$
Side note: $V_\pi(x) = (1 - \gamma)\int_\mathcal X \int_\mathcal A R(x',a) \pi_(a, x') \mu_\pi(x',x)dadx'$
$\nabla_\pi V_\pi(x) = (1 - \gamma)\int_\mathcal X \int_\mathcal A \nabla_\pi \pi_(a, x') Q(x', a) \mu_\pi(x',x)dadx'$
with $\nabla \eta(\pi) = E_{P_1}[\nabla_\pi V_\pi(x)]$
### Undiscounted infinite horizon MDP
$\rho(\pi) = \lim_{T \to \infty} \frac{1}{T} E_{\pi}[\sum_{t=1}^T r_t]$
$\bar \mu_\pi (S) = \lim_{T \to \infty} \frac{1}{T} E_\pi[\sum_{t=1}^T I(X_t \in S)]$
$d\bar\mu_\pi = \mu_\pi (x') dx'$
$\rho(\pi) = \int_\mathcal X \int_\mathcal A (\pi(a,x')r(x',a)da)\mu_\pi(x') dx'$
$V_\pi(x) = \lim_{t \to \infty} E_\pi[\sum_{t=1}^T r_t - \rho_(\pi) | \sigma(x_1 = x)]$
$Q_\pi(x, a) = \lim_{t \to \infty} E_\pi[\sum_{t=1}^T r_t - \rho_(\pi) | \sigma(x_1 = x, a_1 = a)]$
$V_\pi(x) = E_\pi[Q_\pi(x,a)|\sigma(x_1 = x)] = \int_\mathcal A Q_\pi(x,a) \pi(a,x)da$
$\nabla V_\pi(x) = \int_\mathcal A \nabla_\pi \pi(a, x)Q_\pi(x, a) da + \int_\mathcal A \pi(a, x) \nabla_\pi (R(x, a) - \rho_\pi(x) + \int_\mathcal X P(x'|x,a)V(x')dx') da$
$= \int_\mathcal A \nabla_\pi \pi(a, x)Q_\pi(x, a) da + \int_\mathcal A \pi(a, x) (-\nabla_\pi \rho_\pi(x) + \int_\mathcal X P(x'|x,a) \nabla_\pi V(x')dx')da$
$\nabla_\pi \rho_\pi(x) = \int_\mathcal A \pi(a,x) Q_\pi(x,a)da + \int_\mathcal A \pi(a,x) \int_\mathcal X P(x', x, a) \nabla_\pi V_\pi(x')dx'da - \nabla_\pi V_\pi(x)$
$\nabla_\pi \rho(\pi) = \int_\mathcal X \nabla_\pi$
### Policy Parameterization
Consider discounted tabular MDPs:
Direct Parametrization:
$\pi_\theta(a,x) = \theta_{ax} \geq 0, \sum_a \theta_{ax} = 1$
Softmax Parametrization:
$\pi_\theta(a,x) = \frac{\exp\theta_{ax}}{\sum_a \exp \theta_{ax}}$
General functional classes:
$\pi_\theta(a,x) = f_\theta(x,a), \sum_a f_\theta(x,a) = 1$
Define the advantage function $A_\pi(x,a) = Q_\pi(x,a) - V(x)$
Performance Difference Lemma (Kakade and Langford (2002))
$\eta(\pi) - \eta(\pi') = E_{\mu_\pi^\gamma}[E_\pi[A_\pi'(X,A)|X]]$
(On final: derive for undiscounted infinite horizon)
Policy gradient for the direct parametrization:
$\mu_{\pi_\theta}^\gamma(x') = \sum_x \mu_{\pi_\theta}^\gamma(x', x)P_1(x)$
$\nabla_\theta \eta(\pi_\theta) = (1 - \gamma) \sum_{x'} \sum_a \nabla_\theta \pi_\theta(a,x) Q_\pi(x',a)\mu(x')$
$\frac{\partial \eta(\pi_\theta)}{\partial \theta_{ax}} = (1 - \gamma)Q_\pi(x,a)\mu(x)$
PL condition:
Proof: