Planning Option Gradient

--- tags: MDP Gradient --- # Planning Option Gradient [Up](/UGEvKdcPT0uopCxjKCoM-g) ## (Intrinsic) Reward gradient [(I)RG](/97qXRqidSMK7xFjGzRi0PA) Given the the extrinsic MDP $\color{red}{M_{e}} := <P, \color{red}r, \gamma, S, A, s_0>$ $s_0$ is the initial distribution, $\gamma<1$ We want to find the optimal $\theta$ that parametrizes an intrinsic which that maximizes the extrinsic reward: $$ \arg \max_{r_\theta} \enspace V_{\hat\pi}(s_0) = \operatorname E\left[\sum_{t=0}^\infty \gamma^t \color{red}r_t\mid s_0 = s, a \sim \hat\pi\right] $$ An alternative formulation for the tabular case by neumann series convergence. $$ \arg \max_{r_\theta} \enspace (I - \gamma P_\hat\pi)^{-1} \color{red}r_\hat\pi $$ In this objective function $r_\theta$ does not explicitly appear in the objective function, but lets assume there is an unknown function $$\phi_{P, \gamma, S,A, s_0}(\color{green}{r_\theta})=\phi(r_\theta)\rightarrow \hat\pi$$. This function takes as an input MDP $\color{green}{M_\theta} := <P, \color{green}{r_\theta}, \gamma, S, A, s_0>$ and returns an optimal policy for the input MDP. The new objective function is: $$V_{\hat\pi} = [I - \gamma (P\cdot \phi(\color{green}{M_\theta}))\enspace]^{-1} (r \cdot \phi(\color{green}{M_\theta})) $$ To find a gradient we will need a gradient of $V$ w.r.t. $r_\theta$: $$\frac{\partial V_{\hat\pi}}{\partial r_\theta} = \frac{\partial V_{\hat\pi}}{\partial \phi} \cdot \frac{\partial \phi}{\partial r_\theta} $$ In our case $\phi(\color{green}{M_\theta})$ is a differentiable planner, more concretely smooth value iteration, which applies a smooth bellman operator the transport operator: $$T Q: r_\theta + \gamma P\enspace\text{softmax}_\tau(Q)$$ to a vector $Q \enspace N<\infty$ times up until $\epsilon$-convergence. i.e. $Q^\star = T(T(T(...T(Q))$ And $\hat\pi=\text{softmax}(Q^\star)$ ### Implicit Function Theorem (IFT), [differentiation and iterative processes](https://who.rocq.inria.fr/Jean-Charles.Gilbert/preprint/08+ad-iterative-processes.pdf) This iterative process converges when the Bellman Error is equal to 0, here the state wise bellman error is: $BE(\pi, r_\theta) := r_{\hat\pi} + \gamma P_\hat\pi V_\hat\pi - V_\hat\pi$ $BE(\pi, r_\theta) := r_{\hat\pi} + (\gamma P_\hat\pi - 1) V_\hat\pi$ By definition the implicit function $\phi(r_\theta)$ will map a reward $r_\theta$ to an optimal policy $\hat\pi^\star$ that maximizes $r_\theta$, this means that $BE(\hat\pi^\star, r_\theta)=0$ By expliciting the policy in the bellman optimality condition $B_\pi(V)$: IFT (Dini, Pisa) [^1] requires that: 1. We have an equilibrium $BE(\pi, r_\theta) = 0$ around $(\pi^\star, r_\theta)$:heavy_check_mark: 2. B continuously differentiable (Thanks to softmax) :heavy_check_mark: 3. $\frac{\partial BE}{\partial r_\theta} (\pi^\star, r_\theta)$ is invertible :question: Then there exist a function $\phi : r_\theta \mapsto \pi_{r_\theta}$ And it's gradient is: $$ \underset{\hat{\pi}\times r_\theta}{\frac{\partial \phi}{\partial r_\theta}(r_\theta)} = \underset{s\times \hat\pi}{\left[ \frac{\partial BE}{\partial \pi} (\pi^\star, r_\theta) \right ]^{-1} } \underset{s\times r_\theta}{\frac{\partial BE}{\partial r_\theta}} $$ To better understand what's going on, the graph shows isolevels for the bellman error for a 1 state, 1 continuous action MDP ![Bellman Error](https://i.imgur.com/Pccnxuv.jpg) Given a reward function $r_{\theta,0}$ and an initial policy $\pi_0 = softmax(Q_0)$ the $\phi$ function returns one of the $\pi^\star$ policies that are roots of the bellman error. ![](https://i.imgur.com/ctOupD5.jpg) Starting from a valid solution $(\hat\pi^\star, r_\theta)$, in an $\epsilon$-ball $BE$ can be apprximated as: $$BE(\hat\pi^\star, r_\theta)+\nabla BE(\pi, r) \mid_{(\hat\pi^\star, r_\theta)}$$ We are interested only for the optimal values that respect the equilibrium BE=0: $$\underbrace{BE(\hat\pi^\star, r_\theta)}_0+\nabla BE(\pi, r) * [(\pi, r)- (\hat\pi^\star, r_\theta)] = 0$$ Since $\pi$ is implicitly defined by $\phi$ the chain rule applies: $$\frac{\partial BE}{\partial \pi} \frac{\partial \phi}{\partial r_\theta} + \frac{\partial BE}{\partial r_\theta} = 0$$ $$\frac{\partial \phi}{\partial r_\theta}(r_\theta) = -\left [ \frac{\partial BE}{\partial \pi} (\pi^\star, r_\theta) \right ]^{-1} \frac{\partial BE}{\partial r_\theta} $$ ## Multi-Step Models (Random Return Time) ### Single policy, Non-uniform decision points In this setting the policy acts at all time but control (and learning) is not returned at every timestep. Given a continuation function $\lambda(s)\rightarrow[0,1]$ and a termination function $\beta(s) := 1-\lambda(s)$ describing the probabiltiy of returning control. Lets call *decision points* the times at which the agent selects an action. We know that a transition under a policy $\pi$ the expected next state is $$s^\prime=\underset{s \times s^\prime}{P_\pi}\cdot s$$ The probability of transitioning from $s$ to $s^\prime$ and continuing execution: $$\underset{s \times s^\prime} {P_\pi^\lambda} := \underset{s \times s^\prime}{P_\pi} \enspace \underset{s^\prime}\lambda$$ The probability of transitioning from $s$ to $s^\prime$ and returning control (geometric distribution p=$\beta$): $$\underset{s \times s^\prime}{P^\beta_\pi} := \underset{s \times s^\prime}{P_\pi} \enspace \underset{s^\prime}\beta$$ For a multi-step policy the *steady state discounted continuation distribution* is now: $$\gamma P_\pi^\lambda +(\gamma P_\pi^\lambda)^2+(\gamma P_\pi^\lambda)^3+... = \big(\sum_0^\infty (\gamma P_\pi^\lambda)^t\big) - I = \underbrace{(I - \gamma P_\pi^\lambda)^{-1}}_{\text{Neumann series}}-I=\underset{s \times s^\prime}{d_\pi^\lambda}$$ Equivalently a multi-step policy the *steady state discounted termination distribution* is now: $$(\gamma P_{\pi}^\beta)+(\gamma P_{\pi}^\lambda)(\gamma P_\pi^\beta)+(\gamma P_\pi^\lambda)^2(\gamma P_\pi^\beta)+... = (\gamma P_\pi^\beta)\sum_0^\infty (\gamma P_\pi^\lambda)^{t} =$$ $$= P_\pi^\beta \sum_0^\infty (\gamma P_\pi^\lambda)^t = P_\pi^\beta\underbrace{(I - \gamma P_\pi^\lambda)^{-1}}_{\text{Neumann series}}=d_\pi^\lambda P_\pi^\beta=\underset{s \times s^\prime}{d_\pi^\beta}$$ Rename the terms $$\begin{align} \text{execution distribution:} && \underset{s \times s^\prime}{d^\lambda} := d_{\pi, \lambda}\\ \text{decision distribution:} && \underset{s \times s^\prime}{d^\beta} := d_{\pi, \beta} \end{align}$$ A multi step value function is now: $V^\lambda_\pi = \underbrace{d^\lambda \enspace r_\pi}_{\text{execution V}} + \enspace \underbrace{\gamma \enspace d^\beta \cdot V_\pi}_{\text{future V}}$ (aside) This is almost equivalent to a classical value function pre-multiplied by an affine matrix: $V^\lambda_\pi = d^\lambda (r_\pi + \gamma P_{\beta,\pi} \cdot V_\pi )$ The reward model for a *decision point*: $$ \underset{s \times a}{r^\lambda}= \underset{s \times s^\prime}{d^\lambda} \enspace \underset{s \times a}{r} $$ And the *decision point* transition model: $$ \underset{s \times s^\prime}{P^\lambda_\pi}= \underset{s \times s^\prime}{d^\beta}= \underset{s \times s^\prime}{d_\pi^\lambda} \underset{s \times s^\prime}{P_\pi^\beta} $$ [^2] The generalized policy evaluation equations for $\lambda$-models is then: \begin{align} & v_\pi=r^\lambda_\pi \enspace + \enspace P^\lambda_\pi v_\pi \\ & = d_{\pi,\lambda} \enspace (r_\pi \enspace + \enspace P_\pi \beta v_\pi ) \end{align} The smooth multistep value iteration operator is now: $$T Q: r^\lambda + \gamma P^\lambda_\pi\enspace\text{softmax}_\tau(Q)$$ to a vector $Q$ infinite times. $$$$ ### Option-models Lets describe an option $\mathcal{O}$ as the tuple: $<\pi^o,\beta^o>$ A multi-step policy $\mu: S\rightarrow A \cup \mathcal{O}$ over options $\{O^{0},..,O^{K}\}$ ### Single Option, No primitives $A=\emptyset, \mathcal{O}=\{O^0\}$ In this case $\mu$ coincides with $\pi$ from earlier, and by replacing terms we get: \begin{align} & \pi^o=\pi \\ & \beta^o=1-\lambda \\ & d^o=d_{\pi^o,\lambda} \\ \end{align} \begin{align} v=v_\mu=r^o \enspace + \enspace P^o v_\mu \\ \end{align} ### Only Options $A=\emptyset$ $r^\mathcal{O}$ is now a $K\times1$ vector where each element is $r^{o^i}$ and $P^\mathcal{O}$ is a $K\times S$ Matrix \begin{align} v_\mu=r^\mathcal{O}_\mu \enspace + \enspace P^\mathcal{O}_\mu v_\mu \\ \end{align} From the point of view $\mu$ choosing an option is equivalent of following a multi-step model where at every decision point we pick between each action. **TL;DR** Reward Model: $\underset{s\times a}{r^\mathcal{O}}=$ ## Multi-Option Multi-Intrisic Motivation Single-Task ===THIS IS TODO=== This paper has some explicit help? [Approximate Value Iteration with Temporally Extended Actions](https://www.jair.org/index.php/jair/article/view/10947/26080) We now have two policies $\pi_\mu = \text{softmax}(\mu)$ a policy over options that maximizes $r$. And a policy over primitives $\pi^o_\theta$ that maximizes $r_\theta$. The higher layer of abstraction is similar: $$J_{\nu} = (I - \gamma P_{\pi_\mu})^{-1} r_{\pi_\mu} $$ Expanded as: $$= [I - \gamma (P \cdot \pi_\mu)]^{-1} (r \cdot \pi_\mu) $$ where: $$\pi_\mu = f(r, P, \pi_\theta)$$ The planner is now parametrized by the option policy $$\pi_{r_\theta} = f(\theta, P)$$ which has the same planner as the reward from earlier: $$J^o_\tau = [I - \gamma (P\cdot f_{P}(\theta))\enspace]^{-1} (r \cdot f_{P}(\theta)) $$ and the same gradient as earlier. $$\frac{\partial J^o_\tau}{\partial\theta}$$ now by chain rule: $$\frac{\partial J_\nu}{\partial\theta}=\frac{\partial J_\nu}{\partial\mu}\cdot\frac{\partial \mu}{\partial\theta}$$ TODO: replace with $\pi_{r_\theta}$  .  More references: [Continuous Case](https://en.wikipedia.org/wiki/Kolmogorov_equations_(Markov_jump_process)) How does [Nachum, Near-Optimal](https://arxiv.org/pdf/1810.01257.pdf) relate to option gradient? [Linear options](https://web.eecs.umich.edu/~baveja/Papers/linoptAAMAS.pdf) has a similar view for the linear case. Compare with [Distral](https://arxiv.org/abs/1707.04175) [Learning Intrinsic Rewards as a Bi-Level Optimization Problem](http://www.auai.org/uai2020/proceedings/66_main_paper.pdf) Same but with no IFT https://drive.google.com/file/d/1xsiub90zIWn_qQHzUEmKQ2G-ReXI9Bia/view What is happening here could very well be the envelope theorem not IFT in general review all that value functions stuff http://www.pitt.edu/~luca/ECON2001/lecture_13.pdf [^1]: Implicit Function Theorem https://who.rocq.inria.fr/Jean-Charles.Gilbert/preprint/08+ad-iterative-processes.pdf [^2]: In PL thesis it's written as: $P^\lambda_\pi = d_{\pi, \lambda} \enspace \gamma (P_\pi - P_{\pi, \lambda})$ but $(P_\pi - P_{\pi, \lambda}) = P_\pi - P_\pi \lambda = P_\pi (I - \lambda) = P_\pi\beta$ $P^\lambda_\pi = d_{\pi, \lambda} \enspace \gamma P_\pi(I-\lambda)$