Learning policies associated with subgoals can be expensive - data and computation time
New method based on policy gradient theorem - enables simultaneous
Learning process of the intra-option policies
Termination functions
Policy over them
Works with
Linear/Non-Linear function approximators
Discrete/Continous State space
Discrete/Continous Action space
Learning Options
At any time - distill all of the available experience intor every component of system
Value function
Policy over options
Intra-option policies
Termination polices
Call-and-return options execution model
An agent picks option according to its policy over options
Then follows the intra-option policy until termination ()
- Intra-option policy of option parametrized by - Termination function of parameterized by
Option-Value function
- Value of executing an action in the context of a state-option pair
- Option-Value function upon arrival
Theorem 1 - Intra-Option Policy Gradient Theorem
Given a set of Markov options with stochastic intra-option policies differentiable in their parameters
The gradient of the expected discounted return with respect to and initial condition is
- discounted weighting of state-option pairs along trajectories starting from
This gradient describes to effect of a local change at the primitive level on the global expected return
Gradient for termination function
Theorem 2 - Termination Gradient Theorem
Given - set of Markov options with stochastic termination functions differentiable in their parameters
Gradient of expected discounted return objective is - inital condition
Advantage Function - forming a baseline reduces the variance in gradient estimates
Option choice suboptimal - expected value over all options - Advantage function is negative and it drives the gradient corrections up - increases the odds of terminating
Interrupting Execution model
Termination is forced whenever the value of for the current option is less than