Link: https://arxiv.org/abs/2109.12286
Authors: Liyuan Zheng, Tanner Fiez, Zane Alumbaugh, Benjamin Chasnov, Lillian J. Ratliff
Written by: Pronoma Banerjee
The paper discusses the following:
We first start by stating the implicit function theorem used for the objective function of the leader in a Stackelberg game:
Implicit function theorem
Let be a vector-valued function defined in an open set in with values in . Suppose on . Let be a point in for which and for which the determinant . Then there exists a k-dimensional open set containing and one and only one vector valued function , defined on and having values in such that:
Game theoretic priliminaries
A Stackelberg game is a game between two agents where one agent is deemed the leader and the other the follower. The leader optimizes it's objective assuming that the follower will play its best response. We frame the leader's objective using the implicit function theorem.
Let and be the objective functions that the leader and follower want to minimize, respectively, where and are their decision variables or strategies and is their joint strategy.
The leader L aims to solve the following:
min min
and the follower F aims to solve the following problem:
min
Since the leader assumes the follower chooses a best response = arg min, the follower’s decision variables are implicitly a function of the leader’s. The leader utilizes this while deriving the sufficient conditions for its optimization, from the total derivative of its cost function:
.
where
Hence, a point is a local solution to L's objective if and . For the follower’s problem, sufficient conditions for optimality are and .
Next we define a concept that gives the sufficient conditions for a local Stackelberg equilibrium:
Differential Stackelberg Equilibrium
The joint strategy is a differential Stackelberg equilibrium if and
Stackelberg gradient dynamics
It is derived from the first-order gradient based sufficient conditions and are given by:
Motivating Examples
The actor and critic exhibit simple hidden structure in their parameters for all actor-critic algorithms. The actor objective typically has a hidden linear structure in terms of the parameters which is abstractly of the form . The critic objective usually has a hidden quadratic structure in the parameters which is abstractly of the form of . The actor seeks to find the action that maximizes the value indicated by the critic and the critic approximates the rewards of actions generated by the actor. For simplicity, we assume the critic only minimizes the mean square error of the sample action generated by current actor .
Next we briefly go through the various actor-critic algorithms. The vanilla actor-critic and the deep deterministic policy gradient optimize their objectives by performing the individual gradient dynamics (gradient descent and ascent) on the actor and critic parameters. The soft actor-critic exhibit a similar structure as the others, but with entropic regularization in the actor objective.
Actor Critic (AC):
Deep Deterministic Policy Gradient (DDPG):
Soft Actor Critic (SAC):
The target networks in DDPG and SAC are updated by taking the Polyak average of the network parameters over the course of training, and the actor and critic networks are updated by individual gradient dynamics:
Meta algorithm
Algorithm: Stackelberg Actor-Critic Framework
Input: actor-critic algorithm ALG, player designations, and learning rate sequences .
Case 1: if actor is leader, update actor and critic in ALG with:
Case 2: if critic is leader, update actor and critic in ALG with:
When actor is the leader (case 1), the total derivative of the actor is given by:
.
When critic is the leader (case 2), the total derivative of the critic is given by:
.
Stackelberg “Vanilla” Actor-Critic (STAC)
The critic assists the actor in learning the optimal policy by approximating the value function of the current policy. The critic aims to be selecting a best response . Thus, the actor plays the role of the leader and critic the follower.
Theorem 1: Given an MDP and actor-critic parameters , the gradient of with respect to is given by:
.
Theorem 1 allows us to compute directly by since the distribution of does not depend on and can be moved into the expectation.
The critic in AC is often designed to approximate the state value function which has computational advantages, and the policy gradient can be computed by advantage estimation. In this formulation,
and
.
Then can be computed by the next proposition.
Stackelberg DDPG (STDDPG) and SAC (STSAC)
In comparison to on-policy methods where the critic is designed to evaluate the actor using sampled trajectories generated by the current policy, in off-policy methods the critic minimizes the Bellman error using samples from a replay buffer. Thus, the leader and follower designation between the actor and critic in off-policy methods is not as clear. Given the actor as the leader, the algorithms are similar to policy-based methods, where the critic plays an approximate best response to evaluate the current actor. On the other hand, given the critic as the leader, the actor plays an approximate best response to the critic value, resulting in behavior closely resembling that of the value-based methods.
The objective functions of off-policy methods are defined in expectation over an arbitrary distribution from a replay buffer instead of the distribution induced by the current policy. Thus, each terms in the total derivatives updates can be computed directly and estimated by samples.
Convergence guarantee
The following theorem gives a local convergence guarantee to a local Stackelberg equilibrium under the assumption that the maps are Lipschitz, and , the learning rate sequences for the gradient updates such that and for , and the stochastic components in the gradient updates are zero mean, martingale difference sequences.
(A stochastic series is a martingale difference sequence if its expectation with respect to the past is zero.)
This result is effectively giving the guarantee that the discrete-time dynamics locally converge to a stable, game theoretically meaningful equilibrium of the continuous-time system using stochastic approximation methods given suitable learning rate sequences and unbiased gradient estimates.
Implicit Map Regularization
The total derivative in the Stackelberg gradient dynamics requires computing the inverse of follower Hessian . Since critic networks in practical reinforcement learning problems may be highly non-convex, can be ill-conditioned. Thus, instead of computing this term directly in the Stackelberg actor- critic algorithms, we compute a regularized variant of the form . This regularization method can be interpreted as the leader viewing the follower as optimizing a regularized cost , while the follower actually optimizes . Interestingly, the regularization parameter can serve to interpolate
between the Stackelberg and individual gradient updates for the leader.
Performance:
Game-Theoretic Interpretations
SAC is considered the state-of-the-art model-free reinforcement learning algorithm and it significantly outperforms DDPG on most tasks. The interpretation for this is as follows:
The paper revisits the standard actor-critic algorithms from a game-theoretic perspective to capture the hierarchical interaction structure and introduce a Stackelberg framework for actor-critic algorithms. The rest of the paper, through theory and experimentation, shows that the Stackelberg actor-critic algorithms always outperform the existing counterparts when the actor plays the leader.