---
tags: MDP Gradient
---
# Planning Option Gradient
[Up](/UGEvKdcPT0uopCxjKCoM-g)
## (Intrinsic) Reward gradient [(I)RG](/97qXRqidSMK7xFjGzRi0PA)
Given the the extrinsic MDP $\color{red}{M_{e}} := <P, \color{red}r, \gamma, S, A, s_0>$
$s_0$ is the initial distribution, $\gamma<1$
We want to find the optimal $\theta$ that parametrizes an intrinsic which that maximizes the extrinsic reward:
$$
\arg \max_{r_\theta} \enspace
V_{\hat\pi}(s_0) = \operatorname E\left[\sum_{t=0}^\infty \gamma^t \color{red}r_t\mid s_0 = s, a \sim \hat\pi\right]
$$
An alternative formulation for the tabular case by neumann series convergence.
$$
\arg \max_{r_\theta} \enspace (I - \gamma P_\hat\pi)^{-1} \color{red}r_\hat\pi
$$
In this objective function $r_\theta$ does not explicitly appear in the objective function, but lets assume there is an unknown function
$$\phi_{P, \gamma, S,A, s_0}(\color{green}{r_\theta})=\phi(r_\theta)\rightarrow \hat\pi$$.
This function takes as an input MDP $\color{green}{M_\theta} := <P, \color{green}{r_\theta}, \gamma, S, A, s_0>$ and returns an optimal policy for the input MDP.
The new objective function is:
$$V_{\hat\pi} = [I - \gamma (P\cdot \phi(\color{green}{M_\theta}))\enspace]^{-1} (r \cdot \phi(\color{green}{M_\theta})) $$
To find a gradient we will need a gradient of $V$ w.r.t. $r_\theta$:
$$\frac{\partial V_{\hat\pi}}{\partial r_\theta} = \frac{\partial V_{\hat\pi}}{\partial \phi} \cdot \frac{\partial \phi}{\partial r_\theta} $$
In our case $\phi(\color{green}{M_\theta})$ is a differentiable planner, more concretely smooth value iteration, which applies a smooth bellman operator the transport operator:
$$T Q: r_\theta + \gamma P\enspace\text{softmax}_\tau(Q)$$ to a vector $Q \enspace N<\infty$ times up until $\epsilon$-convergence.
i.e.
$Q^\star = T(T(T(...T(Q))$
And
$\hat\pi=\text{softmax}(Q^\star)$
### Implicit Function Theorem (IFT), [differentiation and iterative processes](https://who.rocq.inria.fr/Jean-Charles.Gilbert/preprint/08+ad-iterative-processes.pdf)
This iterative process converges when the Bellman Error is equal to 0, here the state wise bellman error is:
$BE(\pi, r_\theta) := r_{\hat\pi} + \gamma P_\hat\pi V_\hat\pi - V_\hat\pi$
$BE(\pi, r_\theta) := r_{\hat\pi} + (\gamma P_\hat\pi - 1) V_\hat\pi$
By definition the implicit function $\phi(r_\theta)$ will map a reward $r_\theta$ to an optimal policy $\hat\pi^\star$ that maximizes $r_\theta$, this means that $BE(\hat\pi^\star, r_\theta)=0$
By expliciting the policy in the bellman optimality condition $B_\pi(V)$:
IFT (Dini, Pisa) [^1] requires that:
1. We have an equilibrium $BE(\pi, r_\theta) = 0$ around $(\pi^\star, r_\theta)$:heavy_check_mark:
2. B continuously differentiable (Thanks to softmax) :heavy_check_mark:
3. $\frac{\partial BE}{\partial r_\theta} (\pi^\star, r_\theta)$ is invertible :question:
Then there exist a function $\phi : r_\theta \mapsto \pi_{r_\theta}$
And it's gradient is:
$$ \underset{\hat{\pi}\times r_\theta}{\frac{\partial \phi}{\partial r_\theta}(r_\theta)} = \underset{s\times \hat\pi}{\left[ \frac{\partial BE}{\partial \pi} (\pi^\star, r_\theta) \right ]^{-1} } \underset{s\times r_\theta}{\frac{\partial BE}{\partial r_\theta}} $$
To better understand what's going on, the graph shows isolevels for the bellman error for a 1 state, 1 continuous action MDP

Given a reward function $r_{\theta,0}$ and an initial policy $\pi_0 = softmax(Q_0)$ the $\phi$ function returns one of the $\pi^\star$ policies that are roots of the bellman error.

Starting from a valid solution $(\hat\pi^\star, r_\theta)$, in an $\epsilon$-ball $BE$ can be apprximated as:
$$BE(\hat\pi^\star, r_\theta)+\nabla BE(\pi, r) \mid_{(\hat\pi^\star, r_\theta)}$$
We are interested only for the optimal values that respect the equilibrium BE=0:
$$\underbrace{BE(\hat\pi^\star, r_\theta)}_0+\nabla BE(\pi, r) * [(\pi, r)- (\hat\pi^\star, r_\theta)] = 0$$
Since $\pi$ is implicitly defined by $\phi$ the chain rule applies:
$$\frac{\partial BE}{\partial \pi} \frac{\partial \phi}{\partial r_\theta} + \frac{\partial BE}{\partial r_\theta} = 0$$
$$\frac{\partial \phi}{\partial r_\theta}(r_\theta) = -\left [ \frac{\partial BE}{\partial \pi} (\pi^\star, r_\theta) \right ]^{-1} \frac{\partial BE}{\partial r_\theta} $$
## Multi-Step Models (Random Return Time)
### Single policy, Non-uniform decision points
In this setting the policy acts at all time but control (and learning) is not returned at every timestep.
Given a continuation function $\lambda(s)\rightarrow[0,1]$ and a termination function $\beta(s) := 1-\lambda(s)$ describing the probabiltiy of returning control.
Lets call *decision points* the times at which the agent selects an action.
We know that a transition under a policy $\pi$ the expected next state is
$$s^\prime=\underset{s \times s^\prime}{P_\pi}\cdot s$$
The probability of transitioning from $s$ to $s^\prime$ and continuing execution:
$$\underset{s \times s^\prime} {P_\pi^\lambda} := \underset{s \times s^\prime}{P_\pi} \enspace \underset{s^\prime}\lambda$$
The probability of transitioning from $s$ to $s^\prime$ and returning control (geometric distribution p=$\beta$):
$$\underset{s \times s^\prime}{P^\beta_\pi} := \underset{s \times s^\prime}{P_\pi} \enspace \underset{s^\prime}\beta$$
For a multi-step policy the *steady state discounted continuation distribution* is now:
$$\gamma P_\pi^\lambda +(\gamma P_\pi^\lambda)^2+(\gamma P_\pi^\lambda)^3+... = \big(\sum_0^\infty (\gamma P_\pi^\lambda)^t\big) - I = \underbrace{(I - \gamma P_\pi^\lambda)^{-1}}_{\text{Neumann series}}-I=\underset{s \times s^\prime}{d_\pi^\lambda}$$
Equivalently a multi-step policy the *steady state discounted termination distribution* is now:
$$(\gamma P_{\pi}^\beta)+(\gamma P_{\pi}^\lambda)(\gamma P_\pi^\beta)+(\gamma P_\pi^\lambda)^2(\gamma P_\pi^\beta)+... = (\gamma P_\pi^\beta)\sum_0^\infty (\gamma P_\pi^\lambda)^{t} =$$
$$= P_\pi^\beta \sum_0^\infty (\gamma P_\pi^\lambda)^t = P_\pi^\beta\underbrace{(I - \gamma P_\pi^\lambda)^{-1}}_{\text{Neumann series}}=d_\pi^\lambda P_\pi^\beta=\underset{s \times s^\prime}{d_\pi^\beta}$$
Rename the terms
$$\begin{align}
\text{execution distribution:} && \underset{s \times s^\prime}{d^\lambda} := d_{\pi, \lambda}\\
\text{decision distribution:} && \underset{s \times s^\prime}{d^\beta} := d_{\pi, \beta}
\end{align}$$
A multi step value function is now:
$V^\lambda_\pi =
\underbrace{d^\lambda \enspace r_\pi}_{\text{execution V}}
+
\enspace
\underbrace{\gamma \enspace d^\beta \cdot V_\pi}_{\text{future V}}$
(aside) This is almost equivalent to a classical value function pre-multiplied by an affine matrix:
$V^\lambda_\pi = d^\lambda (r_\pi + \gamma P_{\beta,\pi} \cdot V_\pi )$
The reward model for a *decision point*:
$$
\underset{s \times a}{r^\lambda}=
\underset{s \times s^\prime}{d^\lambda}
\enspace
\underset{s \times a}{r}
$$
And the *decision point* transition model:
$$
\underset{s \times s^\prime}{P^\lambda_\pi}=
\underset{s \times s^\prime}{d^\beta}=
\underset{s \times s^\prime}{d_\pi^\lambda}
\underset{s \times s^\prime}{P_\pi^\beta}
$$ [^2]
The generalized policy evaluation equations for $\lambda$-models is then:
\begin{align}
& v_\pi=r^\lambda_\pi \enspace + \enspace P^\lambda_\pi v_\pi \\
& = d_{\pi,\lambda} \enspace (r_\pi \enspace + \enspace P_\pi \beta v_\pi )
\end{align}
The smooth multistep value iteration operator is now:
$$T Q: r^\lambda + \gamma P^\lambda_\pi\enspace\text{softmax}_\tau(Q)$$ to a vector $Q$ infinite times.
$$$$
### Option-models
Lets describe an option $\mathcal{O}$ as the tuple: $<\pi^o,\beta^o>$
A multi-step policy $\mu: S\rightarrow A \cup \mathcal{O}$ over options $\{O^{0},..,O^{K}\}$
### Single Option, No primitives $A=\emptyset, \mathcal{O}=\{O^0\}$
In this case $\mu$ coincides with $\pi$ from earlier, and by replacing terms we get:
\begin{align}
& \pi^o=\pi \\
& \beta^o=1-\lambda \\
& d^o=d_{\pi^o,\lambda} \\
\end{align}
\begin{align}
v=v_\mu=r^o \enspace + \enspace P^o v_\mu \\
\end{align}
### Only Options $A=\emptyset$
$r^\mathcal{O}$ is now a $K\times1$ vector where each element is $r^{o^i}$ and $P^\mathcal{O}$ is a $K\times S$ Matrix
\begin{align}
v_\mu=r^\mathcal{O}_\mu \enspace + \enspace P^\mathcal{O}_\mu v_\mu \\
\end{align}
From the point of view $\mu$ choosing an option is equivalent of following a multi-step model where at every decision point we pick between each action.
**TL;DR**
Reward Model:
$\underset{s\times a}{r^\mathcal{O}}=$
## Multi-Option Multi-Intrisic Motivation Single-Task
===THIS IS TODO===
This paper has some explicit help?
[Approximate Value Iteration with Temporally Extended Actions](https://www.jair.org/index.php/jair/article/view/10947/26080)
We now have two policies $\pi_\mu = \text{softmax}(\mu)$ a policy over options that maximizes $r$.
And a policy over primitives $\pi^o_\theta$ that maximizes $r_\theta$.
The higher layer of abstraction is similar:
$$J_{\nu} = (I - \gamma P_{\pi_\mu})^{-1} r_{\pi_\mu} $$
Expanded as:
$$= [I - \gamma (P \cdot \pi_\mu)]^{-1} (r \cdot \pi_\mu) $$
where:
$$\pi_\mu = f(r, P, \pi_\theta)$$
The planner is now parametrized by the option policy
$$\pi_{r_\theta} = f(\theta, P)$$
which has the same planner as the reward from earlier:
$$J^o_\tau = [I - \gamma (P\cdot f_{P}(\theta))\enspace]^{-1} (r \cdot f_{P}(\theta)) $$
and the same gradient as earlier.
$$\frac{\partial J^o_\tau}{\partial\theta}$$
now by chain rule:
$$\frac{\partial J_\nu}{\partial\theta}=\frac{\partial J_\nu}{\partial\mu}\cdot\frac{\partial \mu}{\partial\theta}$$
TODO: replace with $\pi_{r_\theta}$
<!--
Given P and $\pi^o$, a policy that continues with prob $\lambda(s, s^\prime)$:
The infinite time horizon state distribution is $(I - \gamma P_{\pi, o, \#})^{-1}$
The discounted weighting of states $d^T_{\alpha,\gamma,\pi}= \alpha^T(I-\gamma P_\pi)^{-1}$
Since this is a stationary distribution, we can drop the $\alpha$
The option reward model:
$r^o = d^o r^o_{\pi}$
Expected discounted return until termination
The option transition model:
$P_{\pi,o}=d_{\alpha,\gamma,\pi} \enspace \gamma(P^o_\pi - P^o_{\pi,\#})$
Expectation over the next state upon termination.
$(P^o_\pi - P^o_{\pi,\#})$ represents the probability of transitioning and terminating upon entering the **next** state
Where
$r^o_{\pi}=r^o=\pi^o \cdot r$
$P_{\pi,o,\#}=P^o_\lambda=\pi^o \cdot P \cdot \lambda$
a
b
c
d
e
f
g
<!--
Lets define a $\lambda$-reward model:
$r^o = \sum (\gamma P^o_{\pi})^t r^o$
$\lambda$ transition model:
$P^o = \sum (\gamma P_\pi^o)^t (\gamma P_\pi - \gamma P_\pi^o)$
By Neumman series converge
$V^o = (I - \gamma P^o)^{-1}r_\pi$
$P^o = (I - \gamma P^o)^{-1} \gamma(P_\pi - P_\pi^o)$
-->
.
<!--
Given MDP and $\pi$, an option defined by $<\pi^o, \#^o_s>$:
$r^o=\sum (\gamma P_{\#, o})^t r_{\pi^o}$
$P_\#: s \times s^\prime \rightarrow [0, 1]$ is the prob for the option to continue (not terminate) upon entering $s^\prime$.
$\# = 1 - \beta$
$P_{\pi^o,\#}=\pi^o \cdot P \cdot \lambda = P_{\pi^o} \lambda$
here $\lambda: s \times s^\prime \rightarrow [0, 1]$ is a ~~termination~~ continuation matrix
Option termination probability:
$P^o=P \cdot (1 - \beta)$
## Notes
Solving a RL problem
$\max_v J(r)$
Equilibrium:
$v^\star = \text{lse}_{\pi} r + \gamma P_\pi v^\star_\pi$
The dependnece chain is:
$\theta \mapsto \hat{r} \mapsto \mathcal{O} \mapsto \pi$
\begin{align*}
&\text{maximize}\enspace E[\ \log P_0(S_0) + \sum^{T-1}_{t=0}\ log\ \pi_{\mathcal{O}}P] \\
&\text{subject to}\enspace f(x, \theta)=0\\
\end{align*}
$v=(I - \gamma P_{\pi_h})^{-1}r_\pi$
For us, we want to find the intrisnic reward $\hat{r}$ that maximizes the J functionAnd the updates by following IFT are equivalent by lagrangian multipliers:
The RL problem has a valid solution when:
$B(\theta_r, \theta_\pi) = 0$
Where B is a function that goes:
1. $\mathcal{O} = \phi(\theta_r, P)$
2. $\theta_\pi = \psi(r, \mathcal{O}, P)$
There is an implicit function that maps $\phi: \theta_r \mapsto \{\theta_{\pi_1}, \theta_{\pi_2}, \theta_{\pi_3}, \theta_{\pi_4}, \}$, this $\phi$ is our $\text{intrinsic phase}$ with a fixed number of options.
There is another implicit function that maps $\psi:\{\theta_{\pi_1}, \theta_{\pi_2}, \theta_{\pi_3}, \theta_{\pi_4}, \} \mapsto {\theta_\mathcal{O}}$, this $\phi$ is our $\text{extrinsic phase}$ with a fixed number of options.
Start at $B=0$
The gradient along the equilibrium is:
\begin{align*}
& D_\theta B(\theta_\pi, \theta_r) = 0\\
& D_\theta B(\phi(\theta_r), \theta_r) = 0\\
& D_{\theta_r} B\ D_{\theta_r} \phi+D_{\theta_\pi} B = 0\\
& D_\theta\phi = -[D_\phi h]^{-1}\ D_{\theta_\pi} B\\
\end{align*}
$\hbar$
\begin{align*}
&\text{maximize}\enspace E[\enspace\text{J}(\hat{r}, \theta)\enspace] \\
&\text{subject to}\enspace \text{BellmanOptimality}(\hat{r}, \theta)=0\\
\end{align*}
And the updates by following IFT are equivalent by lagrangian multipliers:
$\theta_{k+1} = \theta_k + \frac{\partial\phi(\theta)}{\partial \theta}$
with $\frac{\partial\phi(x)}{\partial \theta}=-{\frac{\partial h(\cdot)}{\partial x}}^{-1}\frac{\partial h(\cdot)}{\partial \theta}$
Solving a RL problem with options
$\max_v J(r)$
$\max_v J(r)$
# Experiments
## Train environments
```
wwwwwwwwwwwww
w w w
w w w
w w
w w w
w w w
ww wwww w
w www www
w w w
w w w
w w
w w w
wwwwwwwwwwwww
```
4 rooms, train on 80% goal positions, test on 20%
## Other stuff
Each timestep when using a policy with options the state changes as
$s_1 = P(s_1 | P_{\pi, o}, s_0)$
$P_{\pi, o} = P \cdot \pi_o = P \cdot O \cdot \pi$
where $O$ is some made up option policy function that looks like:
$$O =
\begin{bmatrix}
1 & 1 & 1\\
2 & 2 & 2\\
1 & 2 & 1
\end{bmatrix}
$$
each row is an option (row 1, 2 are degenerate ones) and columns are actions
ideally O would be 3d, SxAxO so that each row is a policy
Usually:
$P_{\pi, o} = P \cdot \boxed{\pi_o} = P \cdot \boxed{O \cdot \pi}$
What if...
$P_{\pi, o} = \boxed{P \cdot O} \cdot \pi = P_O \cdot \pi$
$$P_O = P \cdot
\begin{bmatrix}
1 & 1 & 1\\
2 & 2 & 2\\
& \pi_1 \\
& \pi_2 \\
\end{bmatrix}
$$
Our loss can be stuff like
$KL(\pi_0, \pi_1)$ so that the options are different
-->
More references:
[Continuous Case](https://en.wikipedia.org/wiki/Kolmogorov_equations_(Markov_jump_process))
How does [Nachum, Near-Optimal](https://arxiv.org/pdf/1810.01257.pdf) relate to option gradient?
[Linear options](https://web.eecs.umich.edu/~baveja/Papers/linoptAAMAS.pdf) has a similar view for the linear case.
Compare with [Distral](https://arxiv.org/abs/1707.04175)
[Learning Intrinsic Rewards as a Bi-Level Optimization Problem](http://www.auai.org/uai2020/proceedings/66_main_paper.pdf)
Same but with no IFT
https://drive.google.com/file/d/1xsiub90zIWn_qQHzUEmKQ2G-ReXI9Bia/view
What is happening here could very well be the envelope theorem not IFT
in general review all that value functions stuff
http://www.pitt.edu/~luca/ECON2001/lecture_13.pdf
[^1]: Implicit Function Theorem https://who.rocq.inria.fr/Jean-Charles.Gilbert/preprint/08+ad-iterative-processes.pdf
[^2]: In PL thesis it's written as:
$P^\lambda_\pi = d_{\pi, \lambda} \enspace \gamma (P_\pi - P_{\pi, \lambda})$
but $(P_\pi - P_{\pi, \lambda}) = P_\pi - P_\pi \lambda = P_\pi (I - \lambda) = P_\pi\beta$
$P^\lambda_\pi = d_{\pi, \lambda} \enspace \gamma P_\pi(I-\lambda)$