---
title : "The Option-Critic Architecture - Notes"
tags : "IvLabs, RL"
---
# The Option-Critic Architecture
Link to the [Research Paper](https://arxiv.org/abs/1609.05140)
{%pdf https://arxiv.org/pdf/1609.05140.pdf%}
Policy Gradient theorems for options
# Introduction
Existing work has focused on finding subgoals
- Can't scale up due to combinatorial flavour
- Learning policies associated with subgoals can be expensive - data and computation time
New method based on policy gradient theorem - enables simultaneous
- Learning process of the intra-option policies
- Termination functions
- Policy over them
Works with
- Linear/Non-Linear function approximators
- Discrete/Continous State space
- Discrete/Continous Action space
# Learning Options
- At any time - distill all of the available experience intor every component of system
- Value function
- Policy over options
- Intra-option policies
- Termination polices
Call-and-return options execution model
- An agent picks option $\omega$ according to its policy over options $\pi_\Omega$
- Then follows the intra-option policy $\pi_\omega$ until termination ($\beta_\omega$)
$\pi_{\omega,\theta}$ - Intra-option policy of option $\omega$ parametrized by $\theta$
$\beta_{\omega,\nu}$ - Termination function of $\omega$ parameterized by $\nu$
Option-Value function
$Q_{\Omega}(s,\omega) = \underset{a}{\sum}\pi_{\omega,\theta}(a|s)Q_U(s,\omega,a)$
$Q_U(s,\omega,a) : \mathcal S\times\Omega\times\mathcal A\rightarrow\Bbb R$ - Value of executing an action in the context of a state-option pair
$Q_U(s,\omega,a) = r(s,a) + \gamma\underset{s'}{\sum}P(s'|s,a)U(\omega, s')$
$U:\Omega\times\mathcal S\rightarrow\Bbb R$ - Option-Value function upon arrival
$U(\omega,s') = (1-\beta_{\omega,\nu}(s'))Q_{\Omega}(s',\omega) + \beta_{\omega,\nu}(s')V_{\Omega}(s')$
Theorem 1 - Intra-Option Policy Gradient Theorem
- Given a set of Markov options with stochastic intra-option policies differentiable in their parameters $\theta$
- The gradient of the expected discounted return with respect to $\theta$ and initial condition $(s_0,\omega_0)$ is
$\qquad \displaystyle\underset{s,\omega}{\sum}\mu_{\Omega}(s,\omega|s_0,\omega_0)\underset{a}{\sum}\dfrac{\partial\pi_{\omega,\theta}(a|s)}{\partial\theta}Q_U(s,\omega,a)$
- $\mu_{\Omega}(s,\omega|s_0,\omega_0)$ - discounted weighting of state-option pairs along trajectories starting from $(s_0,\omega_0)$
This gradient describes to effect of a local change at the primitive level on the global expected return
Gradient for termination function
$\displaystyle\frac{\partial Q_{\Omega}(s,\omega)}{\partial\nu} = \underset{a}{\sum}\pi_{\omega,\theta}(a|s)\underset{s'}{\sum}\gamma P(s'|s,a)\frac{\partial U(\omega,s')}{\partial\nu}$
Theorem 2 - Termination Gradient Theorem
- Given - set of Markov options with stochastic termination functions differentiable in their parameters $\nu$
- Gradient of expected discounted return objective is - inital condition $(s_1,\omega_0)$
$\qquad -\displaystyle\underset{s',w}{\sum}\mu_{\Omega}(s',\omega|s_1,\omega_0)\frac{\partial\beta_{\omega,\nu}(s')}{\partial\nu}A_{\Omega}(s',\omega)$
Advantage Function - forming a baseline reduces the variance in gradient estimates
Option choice suboptimal - expected value over all options - Advantage function is negative and it drives the gradient corrections up - increases the odds of terminating
Interrupting Execution model
- Termination is forced whenever the value of $Q_{\Omega}(s',\omega)$ for the current option $\omega$ is less than $V_{\Omega}(s')$
# Algorithms and Architecture
![](https://i.imgur.com/LGWmdR4.png)
- Learn at fast timescale while updating the intra-option policies and termination functions at a slower rate
- Actor - Intra-option policeis, termination functions and policy over options
- Critic - $Q_{U}$ and $A_{\Omega}$
![](https://i.imgur.com/rQJyP0B.png)
# Experiments
-