Try   HackMD

Stackelberg Actor-Critic: Game-Theoretic Reinforcement Learning Algorithms

Link: https://arxiv.org/abs/2109.12286
Authors: Liyuan Zheng, Tanner Fiez, Zane Alumbaugh, Benjamin Chasnov, Lillian J. Ratliff
Written by: Pronoma Banerjee


Abstract and Introduction

The paper discusses the following:

  1. It models the actor-critic interactions in the actor-critic reinforcement learning algorithms as a two-player general sum Stackelberg game with a leader-follower structure.
  2. Modeling Contributions: Proposes a framework for the Stackelberg actor-critic algorithms where the leader follows the total derivative of its objective function defined using the implicit function theorem and the follower updates using individual gradient dynamics, and compares it with previous approaches of using individual gradients for both the actor and critic.
  3. Theoretical contributions: Develops a policy gradient theorem (Theorem 1) for the refined update of the total derivative, builds a meta-framework of Stackelberg actor-critic algorithms, adapting the standard actor-critic, deep deterministic policy gradient and soft actor-critic algorithms, and provides a local convergence guarantee (Theorem 2) for them to a local Stackelberg equilibrium defined by gradient-based sufficient conditions.
  4. Experimental contributions: Show through examples that Stackelberg gradient dynamics mitigate cycling and accelerate convergence. Further experiments show that Stackelberg actor-critic algorithms always perform at least as well and often significantly outperform the standard actor-critic algorithm counterparts.

Motivation and Preliminaries

We first start by stating the implicit function theorem used for the objective function of the leader in a Stackelberg game:

Implicit function theorem
Let

f=(f1,f2,...,fn) be a vector-valued function defined in an open set in
Rn+k
with values in
Rn
. Suppose
fC
on
S
. Let
(x0,t0)
be a point in
S
for which
f(x0,t0)=0
and for which the
n×n
determinant
det[Djfi(x0,t0)]0
. Then there exists a k-dimensional open set
T0
containing
t0
and one and only one vector valued function
g
, defined on
T0
and having values in
Rn
such that:

  1. gC
    on
    T0
  2. g(t0)=x0
  3. f(g(t),t)=0
    for all
    tT0

Game theoretic priliminaries

A Stackelberg game is a game between two agents where one agent is deemed the leader and the other the follower. The leader optimizes it's objective assuming that the follower will play its best response. We frame the leader's objective using the implicit function theorem.

Let

f1(x1,x2) and
f2(x1,x2)
be the objective functions that the leader and follower want to minimize, respectively, where
x1X1Rd1
and
x2X2Rd2
are their decision variables or strategies and
x=(x1,x2)X1×X2
is their joint strategy.

The leader L aims to solve the following:
min

x1X1{f1(x1,x2)|x2arg min
yX2f2(x1,y)}

and the follower F aims to solve the following problem:
min

x2X2f2(x1,x2)

Since the leader assumes the follower chooses a best response

x2(x1) = arg min
yf2(x1,y)
, the follower’s decision variables are implicitly a function of the leader’s. The leader utilizes this while deriving the sufficient conditions for its optimization, from the total derivative of its cost function:
f1(x1,x2(x1))=1f1(x)+(x2(x1))2f1(x)
.
where
x2(x1)=(22f2(x))121f2(x)

Hence, a point

x=(x1,x2) is a local solution to L's objective if
f1(x1,x2(x1))=0
and
2f1(x1,x2(x1))>0
. For the follower’s problem, sufficient conditions for optimality are
2f2(x1,x2)=0
and
2f2(x1,x2)>0
.

Next we define a concept that gives the sufficient conditions for a local Stackelberg equilibrium:

Differential Stackelberg Equilibrium
The joint strategy

x=(x1,x2)X1×X2 is a differential Stackelberg equilibrium if
f1(x)=0,2f2(x)=0,2f1(x)>0,
and
22f2(x)>0.

Stackelberg gradient dynamics
It is derived from the first-order gradient based sufficient conditions and are given by:

x1,k+1=x1,kα1f1(x1,k,x2,k)
x2,k+1=x2,kα2f1(x1,k,x2,k)

Motivating Examples
The actor and critic exhibit simple hidden structure in their parameters for all actor-critic algorithms. The actor objective typically has a hidden linear structure in terms of the parameters

θ which is abstractly of the form
Qw(θ)=wμ(θ)
. The critic objective usually has a hidden quadratic structure in the parameters
w
which is abstractly of the form of
(R(θ)Qw(θ))2
. The actor seeks to find the action that maximizes the value indicated by the critic and the critic approximates the rewards of actions generated by the actor. For simplicity, we assume the critic only minimizes the mean square error of the sample action generated by current actor
θ
.

Next we briefly go through the various actor-critic algorithms. The vanilla actor-critic and the deep deterministic policy gradient optimize their objectives by performing the individual gradient dynamics (gradient descent and ascent) on the actor and critic parameters. The soft actor-critic exhibit a similar structure as the others, but with entropic regularization in the actor objective.

  1. Actor Critic (AC):

    • Relies on the critic function
      Qw(s,a)
      parameterized by
      w
      to approximate
      Qπ(s,a)
      .
    • The actor which is parameterized by
      θ
      has the objective:
      J(θ,w)
      = E
      sρ,aπθ(·|s)[Qw(s,a)]

      (from policy gradient theorem).
    • The objective is optimized using gradient ascent where:
      θJ(θ,w)
      = E
      sρ,aπθ(·|s)[θlog(πθ(a|s))Qw(s,a)]
      .
    • The critic which is parameterized has the objective to minimize the mean square error between the Q-functions:
      L(θ,w)
      = E
      sρ,aπθ(·|s)[(Qw(s,a)Qπ(s,a))2]
      ,
      where
      Qπ(s,a)
      is approximated by Monte Carlo estimation or bootstrapping.
  2. Deep Deterministic Policy Gradient (DDPG):

    • Deterministic actor
      μθ(s):SA
      with objective:
      J(θ,w)
      = E
      ξD[Qw(s,μθ(s))]
      .
    • Critic objective is the mean square Bellman error:
      L(θ,w)
      = E
      ξD[(Qw(s,a)(r+γQ0(s,μθ(s))))2]
      .
      where
      ξ=(s,a,r,s)
      ,
      D
      is a replay buffer, and
      Q0
      is a target
      Q
      network.
  3. Soft Actor Critic (SAC):

    • Exploits double Q-learning and employs entropic regularization to encourage exploration.
    • Actor's objective:
      J(θ,w)
      = E
      ξD
      [min
      i=1,2Qwi(s,aθ(s))η
      log
      (πθ(aθ(s)|s))]
      ,
      where
      aθ(s)
      is a sample from
      πθ(.|s)
      and
      η
      is the entropy regularization coefficient.
    • The parameter of the critic is the union of both Q networks parameters
      w={w1,w2}
      and the critic objective is defined correspondingly by:
      L(θ,w)
      = E
      ξD[i=1,2(Qwi(s,a)y(r,s))2]
      ,
      where
      y(r,s)=r+γ
      (min
      Q0,i(s,aθ(s))η
      log
      (πθ(aθ(s)|s)))
      .

The target networks in DDPG and SAC are updated by taking the Polyak average of the network parameters over the course of training, and the actor and critic networks are updated by individual gradient dynamics:

θθ+αθθJ(θ,w)
wwαwwL(θ,w)


Stackelberg framework

Meta algorithm


Algorithm: Stackelberg Actor-Critic Framework

Input: actor-critic algorithm ALG, player designations, and learning rate sequences

αθ,k,αw,k.

Case 1: if actor is leader, update actor and critic in ALG with:

θk+1=θk+αθ,kJ(θk,wk)
wk+1=wkαw,kwL(θk,wk)

Case 2: if critic is leader, update actor and critic in ALG with:

θk+1=θk+αθ,kθJ(θk,wk)
wk+1=wkαw,kL(θk,wk)


When actor is the leader (case 1), the total derivative of the actor is given by:

θJ(θ,w)wθL(θ,w)(w2L(θ,w))1wJ(θ,w).

When critic is the leader (case 2), the total derivative of the critic is given by:

wL(θ,w)θwJ(θ,w)(θ2J(θ,w))1θL(θ,w).

Stackelberg “Vanilla” Actor-Critic (STAC)

The critic assists the actor in learning the optimal policy by approximating the value function of the current policy. The critic aims to be selecting a best response

w(θ)=argminwL(θ,w). Thus, the actor plays the role of the leader and critic the follower.

wJ(θ,w)=Esρ,aπθ(·|s)[wQw(s,a)]
w2L(θ,w)=Esρ,aπθ(·|s)[2wQw(s,a)wQw(s,a)+2(Qw(s,a)Qπ(s,a))w2Qw(s,a)].

  • Theorem 1: Given an MDP and actor-critic parameters

    (θ,w), the gradient of
    L(θ,w)
    with respect to
    θ
    is given by:

    θL(θ,w)=Eτπθ[θlog(πθ(a0|s0))(Qw(s0,a0)Qπ(s0,a0))2
    +t=1Tγtθlog(πθ(at|st))(Qπ(s0,a0)Qw(s0,a0))Qπ(st,at)]
    .

Theorem 1 allows us to compute

θwL(θ,w) directly by
w(θL(θ,w))
since the distribution of
θL(θ,w)
does not depend on
w
and
w
can be moved into the expectation.

The critic in AC is often designed to approximate the state value function

Vπ(s) which has computational advantages, and the policy gradient can be computed by advantage estimation. In this formulation,

J(θ,w)=Eτπθ[r(s0,a0)+Vw(s1)] and
L(θ,w)=Esρ[(Vw(s)Vπ(s))2]
.

Then

θL(θ,w) can be computed by the next proposition.

  • Proposition 1: Given an MDP and actor-critic parameters
    (θ,w)
    , if the critic has the objective function
    L(θ,w)=Esρ[(Vw(s)Vπ(s))2]
    , then
    θL(θ,w)
    is given by
    Eτπθ[2t=0Tγtθ
    log
    πθ(at|st(Vπ(s0)Vw(s0))Qπ(st,at)]
    .

Stackelberg DDPG (STDDPG) and SAC (STSAC)

In comparison to on-policy methods where the critic is designed to evaluate the actor using sampled trajectories generated by the current policy, in off-policy methods the critic minimizes the Bellman error using samples from a replay buffer. Thus, the leader and follower designation between the actor and critic in off-policy methods is not as clear. Given the actor as the leader, the algorithms are similar to policy-based methods, where the critic plays an approximate best response to evaluate the current actor. On the other hand, given the critic as the leader, the actor plays an approximate best response to the critic value, resulting in behavior closely resembling that of the value-based methods.

The objective functions of off-policy methods are defined in expectation over an arbitrary distribution from a replay buffer instead of the distribution induced by the current policy. Thus, each terms in the total derivatives updates can be computed directly and estimated by samples.

Convergence guarantee

The following theorem gives a local convergence guarantee to a local Stackelberg equilibrium under the assumption that the maps

J:RmRmθ,wL:RmRmw are Lipschitz, and
||J||<
, the learning rate sequences for the gradient updates such that
αθ,k=o(αw,k)
and
kαi,k=,kαi,k2<
for
iI={θ,w}
, and the stochastic components in the gradient updates
{ϵi,k}
are zero mean, martingale difference sequences.

(A stochastic series is a martingale difference sequence if its expectation with respect to the past is zero.)

  • Theorem 2: Consider an MDP and actor-critic parameters
    (θ,w).
    Given a locally asymptotically stable Stackelberg equilibrium
    (θ,w)
    of the continuous time-limiting system
    (θ˙,w˙)=(J(θ,w),wL(θ,w)),
    under the above assumptions there exists a neighbourhood
    U
    for which the iterates
    (θk,wk)
    of the discrete time system
    θk+1=θk+αθ,k(J(θ,w)+εθ,k+1)
    ,
    wk+1=wkαw,k(wL(θ,w)+εw,k+1)

This result is effectively giving the guarantee that the discrete-time dynamics locally converge to a stable, game theoretically meaningful equilibrium of the continuous-time system using stochastic approximation methods given suitable learning rate sequences and unbiased gradient estimates.

Implicit Map Regularization

The total derivative in the Stackelberg gradient dynamics requires computing the inverse of follower Hessian

2f2(x). Since critic networks in practical reinforcement learning problems may be highly non-convex,
(2f2(x))1
can be ill-conditioned. Thus, instead of computing this term directly in the Stackelberg actor- critic algorithms, we compute a regularized variant of the form
(2f2(x)+λI)1
. This regularization method can be interpreted as the leader viewing the follower as optimizing a regularized cost
f2(x)+λ2||x2||2
, while the follower actually optimizes
f2(x)
. Interestingly, the regularization parameter
λ
can serve to interpolate
between the Stackelberg and individual gradient updates for the leader.

  • Proposition 2: Consider a Stackelberg game where the leader updates using the regularized total derivative:
    λf1(x)=1f1(x)21f2(x)(22f2(x)+λI)12f1(x)
    . As
    λ0
    then
    λf1(x)f1(x)
    and when
    λ
    then
    λf1(x)1f1(x)
    .

Experiments

Performance:

  1. STAC, STDDPG and STSAC are implemented on several tasks on the OpenAI gym environments. The critic is unrolled m steps between each actor step. For each task, STAC with multiple critic unrolling steps performs the best. This is due to the fact when the critic is closer to the best response, then the real response of the critic is closer to what is anticipated by the Stackelberg gradient for the actor.
  2. STDDPG with both actor and critic as leader overall perform better than DDPG on the same tasks. However, STSAC's performance does not prove to be advantageous over SAC.
  3. In all experiments, when the actor is the leader, the Stackelberg versions either outperform or are comparable to the existing actor-critic algorithms, showing that the Stackelberg framework has an emperical advantage in most tasks.

Game-Theoretic Interpretations
SAC is considered the state-of-the-art model-free reinforcement learning algorithm and it significantly outperforms DDPG on most tasks. The interpretation for this is as follows:

  1. The common interpretation of its advantage is that SAC encourages exploration by penalizing low entropy policies.
  2. The authors further suggest that the hidden structures in AC and DDPG may lead to cyclic behavious in their individual gradient dynamics. SAC constructs a more well-conditioned game structure by regularizing the actor objective, which leads to the learning dynamics converging more directly to the equilibrium. This also explains why we observe improved performance with STAC and STDDPG-AL compared to AC and DDPG, but the performance gap between STSAC-AL and SAC is not as significant.
  3. Experimentally, the actor as the leader always outperforms the critic as the leader. critic objective is typically a quadratic mean square error objective which results in a hidden quadratic structure whereas the actor’s objective typically is in the form of a hidden linear due to parameterization of the Q network and policy. As a result, the critic cost structure is more well-suited for computing an approximate local best response since it is more likely to be well-conditioned. Thus, the critic being the follower is a more natural hierarchical structure of the game.

Conclusion

The paper revisits the standard actor-critic algorithms from a game-theoretic perspective to capture the hierarchical interaction structure and introduce a Stackelberg framework for actor-critic algorithms. The rest of the paper, through theory and experimentation, shows that the Stackelberg actor-critic algorithms always outperform the existing counterparts when the actor plays the leader.


Other Relevant References:

  1. https://hackmd.io/@FtbpSED3RQWclbmbmkChEA/rJFUQA1QO
  2. http://proceedings.mlr.press/v119/fiez20a/fiez20a-supp.pdf
  3. https://sci-hub.se/10.1007/978-1-4612-1029-0