Try   HackMD

The Option-Critic Architecture

Link to the Research Paper

Policy Gradient theorems for options

Introduction

Existing work has focused on finding subgoals

  • Can't scale up due to combinatorial flavour
  • Learning policies associated with subgoals can be expensive - data and computation time

New method based on policy gradient theorem - enables simultaneous

  • Learning process of the intra-option policies
  • Termination functions
  • Policy over them

Works with

  • Linear/Non-Linear function approximators
  • Discrete/Continous State space
  • Discrete/Continous Action space

Learning Options

  • At any time - distill all of the available experience intor every component of system
    • Value function
    • Policy over options
    • Intra-option policies
    • Termination polices

Call-and-return options execution model

  • An agent picks option
    ω
    according to its policy over options
    πΩ
  • Then follows the intra-option policy
    πω
    until termination (
    βω
    )

πω,θ - Intra-option policy of option
ω
parametrized by
θ

βω,ν
- Termination function of
ω
parameterized by
ν

Option-Value function

QΩ(s,ω)=aπω,θ(a|s)QU(s,ω,a)

QU(s,ω,a):S×Ω×AR - Value of executing an action in the context of a state-option pair
QU(s,ω,a)=r(s,a)+γsP(s|s,a)U(ω,s)

U:Ω×SR - Option-Value function upon arrival
U(ω,s)=(1βω,ν(s))QΩ(s,ω)+βω,ν(s)VΩ(s)

Theorem 1 - Intra-Option Policy Gradient Theorem

  • Given a set of Markov options with stochastic intra-option policies differentiable in their parameters
    θ
  • The gradient of the expected discounted return with respect to
    θ
    and initial condition
    (s0,ω0)
    is
    s,ωμΩ(s,ω|s0,ω0)aπω,θ(a|s)θQU(s,ω,a)
  • μΩ(s,ω|s0,ω0)
    - discounted weighting of state-option pairs along trajectories starting from
    (s0,ω0)

This gradient describes to effect of a local change at the primitive level on the global expected return

Gradient for termination function

QΩ(s,ω)ν=aπω,θ(a|s)sγP(s|s,a)U(ω,s)ν

Theorem 2 - Termination Gradient Theorem

  • Given - set of Markov options with stochastic termination functions differentiable in their parameters
    ν
  • Gradient of expected discounted return objective is - inital condition
    (s1,ω0)

    s,wμΩ(s,ω|s1,ω0)βω,ν(s)νAΩ(s,ω)

Advantage Function - forming a baseline reduces the variance in gradient estimates

Option choice suboptimal - expected value over all options - Advantage function is negative and it drives the gradient corrections up - increases the odds of terminating

Interrupting Execution model

  • Termination is forced whenever the value of
    QΩ(s,ω)
    for the current option
    ω
    is less than
    VΩ(s)

Algorithms and Architecture

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Learn at fast timescale while updating the intra-option policies and termination functions at a slower rate
  • Actor - Intra-option policeis, termination functions and policy over options
  • Critic -
    QU
    and
    AΩ

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Experiments