# Supervised Policy Gradients $$ \newcommand{\L}{\mathcal{L}} \newcommand{\E}{\mathbb{E}} \newcommand{\N}{\mathcal{N}} $$ ## Problem The goal is to learn a policy $\pi_\theta(a_t | s_t)$ for a sequential decision making problem, where supervision is available in the form of a loss function $\mathcal{L}(s_t, \theta)$. This could be for example the L2 distance between the mean action of the policy and the optimal ground-truth action. Note the differences from a reward function in standard RL: - $\mathcal{L}$ is a function of the parameters $\theta$ and $s_t$ and is differentiable w.r.t. $\theta$; reward is a function of $(s_t, a_t)$. - $\mathcal{L}$ doesn't depend on the actual sample of $a_t \sim \pi_\theta(a_t|s_t)$, but the entire distribution $\pi_\theta(\cdot | s_t)$ defined by $\theta$. For some time horizon $T$, we want find the policy that minimizes $$ \theta^* = \arg \min_\theta J(\pi_\theta) = \arg \min_\theta \E_{\pi_\theta} \sum_{t=1}^T \L(s_t, \theta) $$ The *trajectory* $\tau$ is defined to be $\tau = (s_0, a_0, s_1, a_1, ..., s_T, a_T)$. The dynamics of the environment is assumed to be an MDP. ## Gradient of objective We proceed along the same lines as the [derivation of policy gradients for a finite-time horizon](http://rail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-5.pdf). However, in our case, the loss is a function of $\theta$, so we will use the product-rule of differentiation to obtain two terms. $$ \begin{align} \nabla_\theta J(\pi_\theta) &= \nabla_\theta \int_\tau \sum_{t=1}^T \L(s_t, \theta) \cdot p_\theta(\tau) \cdot d\tau\\ &= \int_\tau \sum_{t=1}^T \nabla_\theta \L(s_t, \theta) \cdot p_\theta(\tau) \cdot d\tau + \int_\tau \sum_{t=1}^T \L(s_t, \theta) \cdot \nabla_\theta p_\theta(\tau) \cdot d\tau\\ &= \int_\tau \sum_{t=1}^T \nabla_\theta \L(s_t, \theta) \cdot p_\theta(\tau) \cdot d\tau + \int_\tau \sum_{t=1}^T \L(s_t, \theta) \cdot \nabla_\theta \log p_\theta(\tau) \cdot p_\theta(\tau) \cdot d\tau\\ &= \underbrace{\E_{\tau \sim p_\theta} \sum_{t=1}^T \nabla_\theta \L(s_t, \theta)}_{\text{supervised term}} + \underbrace{\E_{\tau \sim p_\theta} \sum_{t=1}^T \L(s_t, \theta) \cdot \nabla_\theta \log p_\theta(\tau)}_{\text{RL term}} \end{align} $$ The derivation is based on [this](https://papers.nips.cc/paper/5542-recurrent-models-of-visual-attention.pdf) and [this](https://arxiv.org/pdf/1412.7755.pdf) paper. A combination of a supervised loss term and an RL term has been used in prior works on active vision such as [this](https://www.seas.upenn.edu/~dineshj/publication/ramakrishnan-2019-emergence/ramakrishnan-2019-emergence.pdf) (page 10: Equations 1, 2, 4). ## RL term vanishes for deterministic continuous policies Assume that we have a deteriministic policy $\mu_\theta(s_t)$ in a continous action space. We will show that in this case, the RL term vanishes. Let us approximate the deterministic policy as a limiting case of a stochastic policy with diminishing variance i.e. $$ \mu_\theta(s_t) = \lim_{\sigma \rightarrow 0} \pi_{\theta, \sigma}(s_t) = \lim_{\sigma \rightarrow 0} \N(\mu_\theta(s_t), \sigma^2) $$ Then, the RL term is $$ \begin{align} \text{RL term} &= \lim_{\sigma \rightarrow 0} \int_\tau \sum_{t=1}^T \L(s_t, \theta) \cdot \nabla_\theta \log p_{\theta, \sigma}(\tau) \cdot p_{\theta, \sigma}(\tau) \cdot d\tau\\ &= \lim_{\sigma \rightarrow 0} \int_\tau \sum_{t=1}^T \L(s_t, \theta) \cdot \sum_{t=1}^T \nabla_\theta \log \pi_{\theta, \sigma} (a_t | s_t) \cdot p_{\theta, \sigma}(\tau) \cdot d\tau\\ &= \lim_{\sigma \rightarrow 0} \int_\tau \sum_{t=1}^T \L(s_t, \theta) \cdot \sum_{t=1}^T \left[ \nabla_\theta \log \frac{1}{\sqrt{2\pi\sigma^2}} -\frac{1}{2} \nabla_\theta \frac{\Big(\mu_\theta(s_t) - a_t\Big)^2}{\sigma^2} \right] \cdot p_{\theta, \sigma}(\tau) \cdot d\tau\\ &= \lim_{\sigma \rightarrow 0} \int_\tau \sum_{t=1}^T \L(s_t, \theta) \cdot \sum_{t=1}^T \left[-\frac{\Big(\mu_\theta(s_t) - a_t \Big)}{\sigma^2} \nabla_\theta \mu_\theta(s_t) \right] \cdot p_{\theta, \sigma}(\tau) \cdot d\tau\\ &= \int_\tau \sum_{t=1}^T \L(s_t, \theta) \cdot \sum_{t=1}^T \left[ -\Big(\mu_\theta(s_t) - a_t \Big) \nabla_\theta \mu_\theta(s_t) \right] \cdot \lim_{\sigma \rightarrow 0} \frac{p_{\theta, \sigma}(\tau)}{\sigma^2} \cdot d\tau\\ &= 0 \end{align} $$ Note that $p_{\theta, \sigma}$ has $\exp \Big( - \frac{k}{\sigma^2} \Big)$ terms. Thus, if any $\tau$ satisfies $\lim_{\sigma \rightarrow 0} \frac{p_{\theta, \sigma}(\tau)}{\sigma^2} \neq 0$, then it implies that $a_t = \mu_\theta(s_t)$ for all $1 \leq t \leq T$. Hence the RL term is 0. So, in the case of deterministic continuous actions, one can ignore the RL term.