--- title : "The Option-Critic Architecture - Notes" tags : "IvLabs, RL" --- # The Option-Critic Architecture Link to the [Research Paper](https://arxiv.org/abs/1609.05140) {%pdf https://arxiv.org/pdf/1609.05140.pdf%} Policy Gradient theorems for options # Introduction Existing work has focused on finding subgoals - Can't scale up due to combinatorial flavour - Learning policies associated with subgoals can be expensive - data and computation time New method based on policy gradient theorem - enables simultaneous - Learning process of the intra-option policies - Termination functions - Policy over them Works with - Linear/Non-Linear function approximators - Discrete/Continous State space - Discrete/Continous Action space # Learning Options - At any time - distill all of the available experience intor every component of system - Value function - Policy over options - Intra-option policies - Termination polices Call-and-return options execution model - An agent picks option $\omega$ according to its policy over options $\pi_\Omega$ - Then follows the intra-option policy $\pi_\omega$ until termination ($\beta_\omega$) $\pi_{\omega,\theta}$ - Intra-option policy of option $\omega$ parametrized by $\theta$ $\beta_{\omega,\nu}$ - Termination function of $\omega$ parameterized by $\nu$ Option-Value function $Q_{\Omega}(s,\omega) = \underset{a}{\sum}\pi_{\omega,\theta}(a|s)Q_U(s,\omega,a)$ $Q_U(s,\omega,a) : \mathcal S\times\Omega\times\mathcal A\rightarrow\Bbb R$ - Value of executing an action in the context of a state-option pair $Q_U(s,\omega,a) = r(s,a) + \gamma\underset{s'}{\sum}P(s'|s,a)U(\omega, s')$ $U:\Omega\times\mathcal S\rightarrow\Bbb R$ - Option-Value function upon arrival $U(\omega,s') = (1-\beta_{\omega,\nu}(s'))Q_{\Omega}(s',\omega) + \beta_{\omega,\nu}(s')V_{\Omega}(s')$ Theorem 1 - Intra-Option Policy Gradient Theorem - Given a set of Markov options with stochastic intra-option policies differentiable in their parameters $\theta$ - The gradient of the expected discounted return with respect to $\theta$ and initial condition $(s_0,\omega_0)$ is $\qquad \displaystyle\underset{s,\omega}{\sum}\mu_{\Omega}(s,\omega|s_0,\omega_0)\underset{a}{\sum}\dfrac{\partial\pi_{\omega,\theta}(a|s)}{\partial\theta}Q_U(s,\omega,a)$ - $\mu_{\Omega}(s,\omega|s_0,\omega_0)$ - discounted weighting of state-option pairs along trajectories starting from $(s_0,\omega_0)$ This gradient describes to effect of a local change at the primitive level on the global expected return Gradient for termination function $\displaystyle\frac{\partial Q_{\Omega}(s,\omega)}{\partial\nu} = \underset{a}{\sum}\pi_{\omega,\theta}(a|s)\underset{s'}{\sum}\gamma P(s'|s,a)\frac{\partial U(\omega,s')}{\partial\nu}$ Theorem 2 - Termination Gradient Theorem - Given - set of Markov options with stochastic termination functions differentiable in their parameters $\nu$ - Gradient of expected discounted return objective is - inital condition $(s_1,\omega_0)$ $\qquad -\displaystyle\underset{s',w}{\sum}\mu_{\Omega}(s',\omega|s_1,\omega_0)\frac{\partial\beta_{\omega,\nu}(s')}{\partial\nu}A_{\Omega}(s',\omega)$ Advantage Function - forming a baseline reduces the variance in gradient estimates Option choice suboptimal - expected value over all options - Advantage function is negative and it drives the gradient corrections up - increases the odds of terminating Interrupting Execution model - Termination is forced whenever the value of $Q_{\Omega}(s',\omega)$ for the current option $\omega$ is less than $V_{\Omega}(s')$ # Algorithms and Architecture ![](https://i.imgur.com/LGWmdR4.png) - Learn at fast timescale while updating the intra-option policies and termination functions at a slower rate - Actor - Intra-option policeis, termination functions and policy over options - Critic - $Q_{U}$ and $A_{\Omega}$ ![](https://i.imgur.com/rQJyP0B.png) # Experiments -