The Option-Critic Architecture

Link to the Research Paper

Policy Gradient theorems for options

Introduction

Existing work has focused on finding subgoals

Can't scale up due to combinatorial flavour
Learning policies associated with subgoals can be expensive - data and computation time

New method based on policy gradient theorem - enables simultaneous

Learning process of the intra-option policies
Termination functions
Policy over them

Works with

Linear/Non-Linear function approximators
Discrete/Continous State space
Discrete/Continous Action space

Learning Options

At any time - distill all of the available experience intor every component of system
- Value function
- Policy over options
- Intra-option policies
- Termination polices

Call-and-return options execution model

An agent picks option
$ω$ according to its policy over options
$π_{Ω}$
Then follows the intra-option policy
$π_{ω}$ until termination (
$β_{ω}$ )

π_{ω, θ}

- Intra-option policy of option

ω

parametrized by

θ

β_{ω, ν}

- Termination function of

ω

parameterized by

ν

Option-Value function

Q_{Ω} (s, ω) = \sum_{a} π_{ω, θ} (a | s) Q_{U} (s, ω, a)

Q_{U} (s, ω, a) : S \times Ω \times A \to R

- Value of executing an action in the context of a state-option pair

Q_{U} (s, ω, a) = r (s, a) + γ \sum_{s^{'}} P (s^{'} | s, a) U (ω, s^{'})

U : Ω \times S \to R

- Option-Value function upon arrival

U (ω, s^{'}) = (1 - β_{ω, ν} (s^{'})) Q_{Ω} (s^{'}, ω) + β_{ω, ν} (s^{'}) V_{Ω} (s^{'})

Theorem 1 - Intra-Option Policy Gradient Theorem

Given a set of Markov options with stochastic intra-option policies differentiable in their parameters
$θ$
The gradient of the expected discounted return with respect to
$θ$ and initial condition
$(s_{0}, ω_{0})$ is

$\sum_{s, ω} μ_{Ω} (s, ω | s_{0}, ω_{0}) \sum_{a} \frac{\partial π_{ω, θ} (a | s)}{\partial θ} Q_{U} (s, ω, a)$
$μ_{Ω} (s, ω | s_{0}, ω_{0})$ - discounted weighting of state-option pairs along trajectories starting from
$(s_{0}, ω_{0})$

This gradient describes to effect of a local change at the primitive level on the global expected return

Gradient for termination function

\frac{\partial Q_{Ω} (s, ω)}{\partial ν} = \sum_{a} π_{ω, θ} (a | s) \sum_{s^{'}} γ P (s^{'} | s, a) \frac{\partial U (ω, s^{'})}{\partial ν}

Theorem 2 - Termination Gradient Theorem

Given - set of Markov options with stochastic termination functions differentiable in their parameters
$ν$
Gradient of expected discounted return objective is - inital condition
$(s_{1}, ω_{0})$

$- \sum_{s^{'}, w} μ_{Ω} (s^{'}, ω | s_{1}, ω_{0}) \frac{\partial β_{ω, ν} (s^{'})}{\partial ν} A_{Ω} (s^{'}, ω)$

Advantage Function - forming a baseline reduces the variance in gradient estimates

Option choice suboptimal - expected value over all options - Advantage function is negative and it drives the gradient corrections up - increases the odds of terminating

Interrupting Execution model

Termination is forced whenever the value of
$Q_{Ω} (s^{'}, ω)$ for the current option
$ω$ is less than
$V_{Ω} (s^{'})$

Algorithms and Architecture

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Learn at fast timescale while updating the intra-option policies and termination functions at a slower rate
Actor - Intra-option policeis, termination functions and policy over options
Critic -
$Q_{U}$ and
$A_{Ω}$