--- tags: MDP Gradient --- # Planning Option Gradient [Up](/UGEvKdcPT0uopCxjKCoM-g) ## (Intrinsic) Reward gradient [(I)RG](/97qXRqidSMK7xFjGzRi0PA) Given the the extrinsic MDP $\color{red}{M_{e}} := <P, \color{red}r, \gamma, S, A, s_0>$ $s_0$ is the initial distribution, $\gamma<1$ We want to find the optimal $\theta$ that parametrizes an intrinsic which that maximizes the extrinsic reward: $$ \arg \max_{r_\theta} \enspace V_{\hat\pi}(s_0) = \operatorname E\left[\sum_{t=0}^\infty \gamma^t \color{red}r_t\mid s_0 = s, a \sim \hat\pi\right] $$ An alternative formulation for the tabular case by neumann series convergence. $$ \arg \max_{r_\theta} \enspace (I - \gamma P_\hat\pi)^{-1} \color{red}r_\hat\pi $$ In this objective function $r_\theta$ does not explicitly appear in the objective function, but lets assume there is an unknown function $$\phi_{P, \gamma, S,A, s_0}(\color{green}{r_\theta})=\phi(r_\theta)\rightarrow \hat\pi$$. This function takes as an input MDP $\color{green}{M_\theta} := <P, \color{green}{r_\theta}, \gamma, S, A, s_0>$ and returns an optimal policy for the input MDP. The new objective function is: $$V_{\hat\pi} = [I - \gamma (P\cdot \phi(\color{green}{M_\theta}))\enspace]^{-1} (r \cdot \phi(\color{green}{M_\theta})) $$ To find a gradient we will need a gradient of $V$ w.r.t. $r_\theta$: $$\frac{\partial V_{\hat\pi}}{\partial r_\theta} = \frac{\partial V_{\hat\pi}}{\partial \phi} \cdot \frac{\partial \phi}{\partial r_\theta} $$ In our case $\phi(\color{green}{M_\theta})$ is a differentiable planner, more concretely smooth value iteration, which applies a smooth bellman operator the transport operator: $$T Q: r_\theta + \gamma P\enspace\text{softmax}_\tau(Q)$$ to a vector $Q \enspace N<\infty$ times up until $\epsilon$-convergence. i.e. $Q^\star = T(T(T(...T(Q))$ And $\hat\pi=\text{softmax}(Q^\star)$ ### Implicit Function Theorem (IFT), [differentiation and iterative processes](https://who.rocq.inria.fr/Jean-Charles.Gilbert/preprint/08+ad-iterative-processes.pdf) This iterative process converges when the Bellman Error is equal to 0, here the state wise bellman error is: $BE(\pi, r_\theta) := r_{\hat\pi} + \gamma P_\hat\pi V_\hat\pi - V_\hat\pi$ $BE(\pi, r_\theta) := r_{\hat\pi} + (\gamma P_\hat\pi - 1) V_\hat\pi$ By definition the implicit function $\phi(r_\theta)$ will map a reward $r_\theta$ to an optimal policy $\hat\pi^\star$ that maximizes $r_\theta$, this means that $BE(\hat\pi^\star, r_\theta)=0$ By expliciting the policy in the bellman optimality condition $B_\pi(V)$: IFT (Dini, Pisa) [^1] requires that: 1. We have an equilibrium $BE(\pi, r_\theta) = 0$ around $(\pi^\star, r_\theta)$:heavy_check_mark: 2. B continuously differentiable (Thanks to softmax) :heavy_check_mark: 3. $\frac{\partial BE}{\partial r_\theta} (\pi^\star, r_\theta)$ is invertible :question: Then there exist a function $\phi : r_\theta \mapsto \pi_{r_\theta}$ And it's gradient is: $$ \underset{\hat{\pi}\times r_\theta}{\frac{\partial \phi}{\partial r_\theta}(r_\theta)} = \underset{s\times \hat\pi}{\left[ \frac{\partial BE}{\partial \pi} (\pi^\star, r_\theta) \right ]^{-1} } \underset{s\times r_\theta}{\frac{\partial BE}{\partial r_\theta}} $$ To better understand what's going on, the graph shows isolevels for the bellman error for a 1 state, 1 continuous action MDP ![Bellman Error](https://i.imgur.com/Pccnxuv.jpg) Given a reward function $r_{\theta,0}$ and an initial policy $\pi_0 = softmax(Q_0)$ the $\phi$ function returns one of the $\pi^\star$ policies that are roots of the bellman error. ![](https://i.imgur.com/ctOupD5.jpg) Starting from a valid solution $(\hat\pi^\star, r_\theta)$, in an $\epsilon$-ball $BE$ can be apprximated as: $$BE(\hat\pi^\star, r_\theta)+\nabla BE(\pi, r) \mid_{(\hat\pi^\star, r_\theta)}$$ We are interested only for the optimal values that respect the equilibrium BE=0: $$\underbrace{BE(\hat\pi^\star, r_\theta)}_0+\nabla BE(\pi, r) * [(\pi, r)- (\hat\pi^\star, r_\theta)] = 0$$ Since $\pi$ is implicitly defined by $\phi$ the chain rule applies: $$\frac{\partial BE}{\partial \pi} \frac{\partial \phi}{\partial r_\theta} + \frac{\partial BE}{\partial r_\theta} = 0$$ $$\frac{\partial \phi}{\partial r_\theta}(r_\theta) = -\left [ \frac{\partial BE}{\partial \pi} (\pi^\star, r_\theta) \right ]^{-1} \frac{\partial BE}{\partial r_\theta} $$ ## Multi-Step Models (Random Return Time) ### Single policy, Non-uniform decision points In this setting the policy acts at all time but control (and learning) is not returned at every timestep. Given a continuation function $\lambda(s)\rightarrow[0,1]$ and a termination function $\beta(s) := 1-\lambda(s)$ describing the probabiltiy of returning control. Lets call *decision points* the times at which the agent selects an action. We know that a transition under a policy $\pi$ the expected next state is $$s^\prime=\underset{s \times s^\prime}{P_\pi}\cdot s$$ The probability of transitioning from $s$ to $s^\prime$ and continuing execution: $$\underset{s \times s^\prime} {P_\pi^\lambda} := \underset{s \times s^\prime}{P_\pi} \enspace \underset{s^\prime}\lambda$$ The probability of transitioning from $s$ to $s^\prime$ and returning control (geometric distribution p=$\beta$): $$\underset{s \times s^\prime}{P^\beta_\pi} := \underset{s \times s^\prime}{P_\pi} \enspace \underset{s^\prime}\beta$$ For a multi-step policy the *steady state discounted continuation distribution* is now: $$\gamma P_\pi^\lambda +(\gamma P_\pi^\lambda)^2+(\gamma P_\pi^\lambda)^3+... = \big(\sum_0^\infty (\gamma P_\pi^\lambda)^t\big) - I = \underbrace{(I - \gamma P_\pi^\lambda)^{-1}}_{\text{Neumann series}}-I=\underset{s \times s^\prime}{d_\pi^\lambda}$$ Equivalently a multi-step policy the *steady state discounted termination distribution* is now: $$(\gamma P_{\pi}^\beta)+(\gamma P_{\pi}^\lambda)(\gamma P_\pi^\beta)+(\gamma P_\pi^\lambda)^2(\gamma P_\pi^\beta)+... = (\gamma P_\pi^\beta)\sum_0^\infty (\gamma P_\pi^\lambda)^{t} =$$ $$= P_\pi^\beta \sum_0^\infty (\gamma P_\pi^\lambda)^t = P_\pi^\beta\underbrace{(I - \gamma P_\pi^\lambda)^{-1}}_{\text{Neumann series}}=d_\pi^\lambda P_\pi^\beta=\underset{s \times s^\prime}{d_\pi^\beta}$$ Rename the terms $$\begin{align} \text{execution distribution:} && \underset{s \times s^\prime}{d^\lambda} := d_{\pi, \lambda}\\ \text{decision distribution:} && \underset{s \times s^\prime}{d^\beta} := d_{\pi, \beta} \end{align}$$ A multi step value function is now: $V^\lambda_\pi = \underbrace{d^\lambda \enspace r_\pi}_{\text{execution V}} + \enspace \underbrace{\gamma \enspace d^\beta \cdot V_\pi}_{\text{future V}}$ (aside) This is almost equivalent to a classical value function pre-multiplied by an affine matrix: $V^\lambda_\pi = d^\lambda (r_\pi + \gamma P_{\beta,\pi} \cdot V_\pi )$ The reward model for a *decision point*: $$ \underset{s \times a}{r^\lambda}= \underset{s \times s^\prime}{d^\lambda} \enspace \underset{s \times a}{r} $$ And the *decision point* transition model: $$ \underset{s \times s^\prime}{P^\lambda_\pi}= \underset{s \times s^\prime}{d^\beta}= \underset{s \times s^\prime}{d_\pi^\lambda} \underset{s \times s^\prime}{P_\pi^\beta} $$ [^2] The generalized policy evaluation equations for $\lambda$-models is then: \begin{align} & v_\pi=r^\lambda_\pi \enspace + \enspace P^\lambda_\pi v_\pi \\ & = d_{\pi,\lambda} \enspace (r_\pi \enspace + \enspace P_\pi \beta v_\pi ) \end{align} The smooth multistep value iteration operator is now: $$T Q: r^\lambda + \gamma P^\lambda_\pi\enspace\text{softmax}_\tau(Q)$$ to a vector $Q$ infinite times. $$$$ ### Option-models Lets describe an option $\mathcal{O}$ as the tuple: $<\pi^o,\beta^o>$ A multi-step policy $\mu: S\rightarrow A \cup \mathcal{O}$ over options $\{O^{0},..,O^{K}\}$ ### Single Option, No primitives $A=\emptyset, \mathcal{O}=\{O^0\}$ In this case $\mu$ coincides with $\pi$ from earlier, and by replacing terms we get: \begin{align} & \pi^o=\pi \\ & \beta^o=1-\lambda \\ & d^o=d_{\pi^o,\lambda} \\ \end{align} \begin{align} v=v_\mu=r^o \enspace + \enspace P^o v_\mu \\ \end{align} ### Only Options $A=\emptyset$ $r^\mathcal{O}$ is now a $K\times1$ vector where each element is $r^{o^i}$ and $P^\mathcal{O}$ is a $K\times S$ Matrix \begin{align} v_\mu=r^\mathcal{O}_\mu \enspace + \enspace P^\mathcal{O}_\mu v_\mu \\ \end{align} From the point of view $\mu$ choosing an option is equivalent of following a multi-step model where at every decision point we pick between each action. **TL;DR** Reward Model: $\underset{s\times a}{r^\mathcal{O}}=$ ## Multi-Option Multi-Intrisic Motivation Single-Task ===THIS IS TODO=== This paper has some explicit help? [Approximate Value Iteration with Temporally Extended Actions](https://www.jair.org/index.php/jair/article/view/10947/26080) We now have two policies $\pi_\mu = \text{softmax}(\mu)$ a policy over options that maximizes $r$. And a policy over primitives $\pi^o_\theta$ that maximizes $r_\theta$. The higher layer of abstraction is similar: $$J_{\nu} = (I - \gamma P_{\pi_\mu})^{-1} r_{\pi_\mu} $$ Expanded as: $$= [I - \gamma (P \cdot \pi_\mu)]^{-1} (r \cdot \pi_\mu) $$ where: $$\pi_\mu = f(r, P, \pi_\theta)$$ The planner is now parametrized by the option policy $$\pi_{r_\theta} = f(\theta, P)$$ which has the same planner as the reward from earlier: $$J^o_\tau = [I - \gamma (P\cdot f_{P}(\theta))\enspace]^{-1} (r \cdot f_{P}(\theta)) $$ and the same gradient as earlier. $$\frac{\partial J^o_\tau}{\partial\theta}$$ now by chain rule: $$\frac{\partial J_\nu}{\partial\theta}=\frac{\partial J_\nu}{\partial\mu}\cdot\frac{\partial \mu}{\partial\theta}$$ TODO: replace with $\pi_{r_\theta}$ <!-- Given P and $\pi^o$, a policy that continues with prob $\lambda(s, s^\prime)$: The infinite time horizon state distribution is $(I - \gamma P_{\pi, o, \#})^{-1}$ The discounted weighting of states $d^T_{\alpha,\gamma,\pi}= \alpha^T(I-\gamma P_\pi)^{-1}$ Since this is a stationary distribution, we can drop the $\alpha$ The option reward model: $r^o = d^o r^o_{\pi}$ Expected discounted return until termination The option transition model: $P_{\pi,o}=d_{\alpha,\gamma,\pi} \enspace \gamma(P^o_\pi - P^o_{\pi,\#})$ Expectation over the next state upon termination. $(P^o_\pi - P^o_{\pi,\#})$ represents the probability of transitioning and terminating upon entering the **next** state Where $r^o_{\pi}=r^o=\pi^o \cdot r$ $P_{\pi,o,\#}=P^o_\lambda=\pi^o \cdot P \cdot \lambda$ a b c d e f g <!-- Lets define a $\lambda$-reward model: $r^o = \sum (\gamma P^o_{\pi})^t r^o$ $\lambda$ transition model: $P^o = \sum (\gamma P_\pi^o)^t (\gamma P_\pi - \gamma P_\pi^o)$ By Neumman series converge $V^o = (I - \gamma P^o)^{-1}r_\pi$ $P^o = (I - \gamma P^o)^{-1} \gamma(P_\pi - P_\pi^o)$ --> . <!-- Given MDP and $\pi$, an option defined by $<\pi^o, \#^o_s>$: $r^o=\sum (\gamma P_{\#, o})^t r_{\pi^o}$ $P_\#: s \times s^\prime \rightarrow [0, 1]$ is the prob for the option to continue (not terminate) upon entering $s^\prime$. $\# = 1 - \beta$ $P_{\pi^o,\#}=\pi^o \cdot P \cdot \lambda = P_{\pi^o} \lambda$ here $\lambda: s \times s^\prime \rightarrow [0, 1]$ is a ~~termination~~ continuation matrix Option termination probability: $P^o=P \cdot (1 - \beta)$ ## Notes Solving a RL problem $\max_v J(r)$ Equilibrium: $v^\star = \text{lse}_{\pi} r + \gamma P_\pi v^\star_\pi$ The dependnece chain is: $\theta \mapsto \hat{r} \mapsto \mathcal{O} \mapsto \pi$ \begin{align*} &\text{maximize}\enspace E[\ \log P_0(S_0) + \sum^{T-1}_{t=0}\ log\ \pi_{\mathcal{O}}P] \\ &\text{subject to}\enspace f(x, \theta)=0\\ \end{align*} $v=(I - \gamma P_{\pi_h})^{-1}r_\pi$ For us, we want to find the intrisnic reward $\hat{r}$ that maximizes the J functionAnd the updates by following IFT are equivalent by lagrangian multipliers: The RL problem has a valid solution when: $B(\theta_r, \theta_\pi) = 0$ Where B is a function that goes: 1. $\mathcal{O} = \phi(\theta_r, P)$ 2. $\theta_\pi = \psi(r, \mathcal{O}, P)$ There is an implicit function that maps $\phi: \theta_r \mapsto \{\theta_{\pi_1}, \theta_{\pi_2}, \theta_{\pi_3}, \theta_{\pi_4}, \}$, this $\phi$ is our $\text{intrinsic phase}$ with a fixed number of options. There is another implicit function that maps $\psi:\{\theta_{\pi_1}, \theta_{\pi_2}, \theta_{\pi_3}, \theta_{\pi_4}, \} \mapsto {\theta_\mathcal{O}}$, this $\phi$ is our $\text{extrinsic phase}$ with a fixed number of options. Start at $B=0$ The gradient along the equilibrium is: \begin{align*} & D_\theta B(\theta_\pi, \theta_r) = 0\\ & D_\theta B(\phi(\theta_r), \theta_r) = 0\\ & D_{\theta_r} B\ D_{\theta_r} \phi+D_{\theta_\pi} B = 0\\ & D_\theta\phi = -[D_\phi h]^{-1}\ D_{\theta_\pi} B\\ \end{align*} $\hbar$ \begin{align*} &\text{maximize}\enspace E[\enspace\text{J}(\hat{r}, \theta)\enspace] \\ &\text{subject to}\enspace \text{BellmanOptimality}(\hat{r}, \theta)=0\\ \end{align*} And the updates by following IFT are equivalent by lagrangian multipliers: $\theta_{k+1} = \theta_k + \frac{\partial\phi(\theta)}{\partial \theta}$ with $\frac{\partial\phi(x)}{\partial \theta}=-{\frac{\partial h(\cdot)}{\partial x}}^{-1}\frac{\partial h(\cdot)}{\partial \theta}$ Solving a RL problem with options $\max_v J(r)$ $\max_v J(r)$ # Experiments ## Train environments ``` wwwwwwwwwwwww w w w w w w w w w w w w w w ww wwww w w www www w w w w w w w w w w w wwwwwwwwwwwww ``` 4 rooms, train on 80% goal positions, test on 20% ## Other stuff Each timestep when using a policy with options the state changes as $s_1 = P(s_1 | P_{\pi, o}, s_0)$ $P_{\pi, o} = P \cdot \pi_o = P \cdot O \cdot \pi$ where $O$ is some made up option policy function that looks like: $$O = \begin{bmatrix} 1 & 1 & 1\\ 2 & 2 & 2\\ 1 & 2 & 1 \end{bmatrix} $$ each row is an option (row 1, 2 are degenerate ones) and columns are actions ideally O would be 3d, SxAxO so that each row is a policy Usually: $P_{\pi, o} = P \cdot \boxed{\pi_o} = P \cdot \boxed{O \cdot \pi}$ What if... $P_{\pi, o} = \boxed{P \cdot O} \cdot \pi = P_O \cdot \pi$ $$P_O = P \cdot \begin{bmatrix} 1 & 1 & 1\\ 2 & 2 & 2\\ & \pi_1 \\ & \pi_2 \\ \end{bmatrix} $$ Our loss can be stuff like $KL(\pi_0, \pi_1)$ so that the options are different --> More references: [Continuous Case](https://en.wikipedia.org/wiki/Kolmogorov_equations_(Markov_jump_process)) How does [Nachum, Near-Optimal](https://arxiv.org/pdf/1810.01257.pdf) relate to option gradient? [Linear options](https://web.eecs.umich.edu/~baveja/Papers/linoptAAMAS.pdf) has a similar view for the linear case. Compare with [Distral](https://arxiv.org/abs/1707.04175) [Learning Intrinsic Rewards as a Bi-Level Optimization Problem](http://www.auai.org/uai2020/proceedings/66_main_paper.pdf) Same but with no IFT https://drive.google.com/file/d/1xsiub90zIWn_qQHzUEmKQ2G-ReXI9Bia/view What is happening here could very well be the envelope theorem not IFT in general review all that value functions stuff http://www.pitt.edu/~luca/ECON2001/lecture_13.pdf [^1]: Implicit Function Theorem https://who.rocq.inria.fr/Jean-Charles.Gilbert/preprint/08+ad-iterative-processes.pdf [^2]: In PL thesis it's written as: $P^\lambda_\pi = d_{\pi, \lambda} \enspace \gamma (P_\pi - P_{\pi, \lambda})$ but $(P_\pi - P_{\pi, \lambda}) = P_\pi - P_\pi \lambda = P_\pi (I - \lambda) = P_\pi\beta$ $P^\lambda_\pi = d_{\pi, \lambda} \enspace \gamma P_\pi(I-\lambda)$