# Stackelberg Actor-Critic: A Game-Theoretic Perspective
#### Author: [Sharath Chandra](https://sharathraparthy.github.io/)
## [Paper Link](http://aaai-rlg.mlanctot.info/papers/AAAI21-RLG_paper_51.pdf)
## Introduction
The RL algorithms can be classified into two: 1) Policy based methods and 2) Value based methods. While these two methods enjoy their own advantages, they also have some disadvantages. Like the policy based methods are known to be sample inefficient and the value based methods might face computational bottleneck in solving value maximizing action. Actor-critic based methods aim to leverage the advantages of both policy and value based methods by learning both the policy (actor) and the value function (critic). These techniques are shown to reduce the return estimation variance.
This paper aims to view the problem of learning a critic that approximates the expected return of the actor and an actor that does the policy optimization based on the critic's estimation as a game between two player. There is an implicit hierarchical structure between these two players and the authors propose to use the stackelberg game formalism between the actor and critic. They then leverage the implicit function theorem to derive the gradient updates between these two players to take into account the implicit relation between players parameters.
## Game theory primer
### Simultaneous games
In game theory, there is a class of games which are simultaneous in nature. In these game settings, the players choose their actions simultaneously without the knowledge of the actions taken by the other agents. Formally, in simultaneous games, there are a set of $n$ players, $\{1, 2, 3, \dots , n \}$ and each player has their own set of strategies $\mathcal{S}_i$ and pay-offs $u_i: \mathcal{S} \rightarrow \mathbb{R}$. The most common representation of these type of games is the matrix representation (normal form). For a two-player simultaneous game, we can represent this game via $S_i \times S_j$ pay-off matrix, where $S_i$ and $S_j$ are number of strategies of the two players $i$ and $j$. The solution concept for these games is given by the notion of \textit{Nash equilibrium} which states that the players operating under nash equilibria does not gain much by deviating from their strategies. Formally, a strategy profile $s^\star = (s^\star_i, s^\star_{-i})$ is said to be a Nash equilibrium if
\begin{equation}
u_i(s^\star_i, s^\star_{-i}) \geq u_i(s_i, s^\star_{-i}) \ \ \ \forall s_i \in \mathcal{S}_i
\end{equation}
### Extensive-form games
Normal form games discussed above does not incorporate any notion of sequence or time into the game. The extensive form games is another class of games which brings a temporal structure to the normal form games wherein the players interact sequentially optimizing their own objectives in the game. These games are generally represented by a tree. This game is further classifies into two depending on whether the players are able to reason about the other players actions, \textbf{perfect-information games} and \textbf{imperfect-information games}. These are formally defined as follows:
#### Definition 1.0 (Perfect-information games)
A perfect information extensive form is defined by a tuple $(N, A, H, Z, \chi X, \rho, \sigma, u)$ composed of:
1. $N$ is a set of players, $\{1, 2, \dots, n\}$
1. $A$ is the action set for all players
1. $H$ is a finite set of possible action sequences (choice nodes)
1. $Z$ is a finite set of terminal nodes which is disjoint from $H$
1. $\chi X: H \rightarrow N$ is an action function which assigns each choice node to a set of actions
1. $\rho: H \rightarrow N$ is a player function assigns each non-terminal node $h$ to a player $i \in N$ who choses an action at $h$
1. $\sigma: H \times A \rightarrow H \cup U$ is a successor function which maps a choice node and an action node to a new choice node or the terminal node such that $\forall h_1, h_2 \in H$ and $a_1, a_2 \in A$, if $\sigma(h_1, a_1) = \sigma(h_2, a_2)$ then $h_1 = h_2$ and $a_1 = a_2$.
1. $u = \{u_1, u_2, \dots , u_n \}$; $u_i: Z \rightarrow \mathbb{R}$ is the utility function of the player $i \in N$
#### Definition 1.1 (Imperfect-information games)
An imperfect information extensive form is defined by a tuple $(N, A, H, Z, \chi X, \rho, \sigma, u, I)$ composed of:
1. A perfect information extensive form game $(N, A, H, Z, \chi X, \rho, \sigma, u)$ and,
1. $I = \{I_1, I_2, \dots , I_n \}$, where $I_i = \{I_{i, 1}, \dots , I_{i, k}\}$ is an equivalence relation on $\{h \in H : \rho(h) = i\}$ with the property $\chi X(h) = \chi X(h')$ and $\rho(h) = \rho(h')$ whenever there exists a $j$ for which $h \in I_{i, j}$ and $h' \in I_{i, j}$
## Stackelberg Games
Stackelberg games are sequential games where the players play the game in a pre-specified order. In this form of game, the $i^{\text{th}}$ player selects his strategy $s_i \in \mathcal{S}$ and informs the $i+1^{\text{th}}$ player about all the preceding strategies $s_{0:i}$. Hence we can see a hierarchical behaviour between the player who moves first, \textit{the leader}, and the players who moves after the leader, "the followers". Each player solves an optimization problem on their own objective/pay-offs with the knowledge that the followers will select a best-response. To formalise this, let's consider a two-player stackelberg game where the leaders payoff is $u_l: \mathcal{S} \rightarrow \mathbb{R}$ and the followers payoff is $u_f: \mathcal{S} \rightarrow \mathbb{R}$, where $\mathcal{S} = S_l \times S_f$ is the joint strategy space. For a sufficiently smooth payoff functions, the optimization problems for the leader and the follower are given as following;
\begin{equation}
\min_{s_l \in S_l} \{u_l(s_l, s_f) \mid s_f \in \arg \min_{y \in S_f}u_f(s_l, y) \} \ \ \ \ \ (\text{leader's optimization problem})
\end{equation}
\begin{equation}
\min_{s_f \in S_f} u_f(s_l, s_f) \ \ \ \ \ (\text{followers's optimization problem})
\end{equation}
#### Definition 1.3 (Stackelberg equilibrium)
A strategy $s_l^\star \in \mathcal{S_l}$ is said to be a stackelberg equilibrium for the leader if
\begin{equation}
\sup_{s_f} \in \mathcal{R}(s_l^\star) s_l(s_l^\star, s_f) \leq \sup_{s_f} \in \mathcal{R}(s_f) s_l(s_l, s_f)
\end{equation}
where $\mathcal{R}(s_l) = \{y \in S_f \mid u_f(s_l, y) \leq u_l(s_l, s_f) \forall s_f \in S_f \}$
## Stackelberg Actor-critic
We are interested in formulating the interaction between the actor and the critic as a stackelberg game. Here, the players are:
1. Actor (the leader) : Aims to carry the policy optimization based on the critic's feedback.
1. Critic (the follower): Aims at minimizing the error between the value functions.
The optimization problem of this two-player general sum stackelberg game is given by
\begin{equation}
\min_{\phi \in \Phi} \{l(\phi, \theta) \mid \theta \in \arg \max_{\theta'\in \Theta} \mathcal{J}(\phi, \theta') \}
\end{equation}
where the critic's objective function is now a function of both player's parameters;
\begin{equation}
l(\theta, w) = \mathbb{E}\left[(Q_w)(s, a) - Q^\pi(s, a))^2 \right]
\end{equation}
To give an accurate value approximation to the actor, the critic aims to play the best response given by $w^*(\theta) = \arg \min_{\theta}L(\theta, w)$. The actor then utilizes this knowledge aims to solve this bilevel optimization problem
\begin{equation}
\begin{split}
\max_\theta J(\theta, w^*(\theta))\\
\text{s.t.} \ \ w^*(\theta) = \arg \min_{\theta}L(\theta, w)
\end{split}
\end{equation}
Here the optimization of the leader follows the fact that the follower is acting optimally to it's choice. We will make use of the implicit relationship between the leader parameters $w$ on the policy parameters $\theta$.
\begin{equation}
\frac{d\ J(\theta, w^*(\theta)}{d\theta} = \frac{\partial{ J(\theta, w)}}{\partial{ \theta}} + \frac{d w(\theta)}{d \theta}\frac{\partial{ J(\theta, w)}}{\partial{\theta} }
\end{equation}
where the implicit jacobian term $\frac{d w(\theta)}{d \theta}$ is given by the implicit function theorem;
\begin{equation}
\frac{d w(\theta)}{d \theta} = -\left( \frac{\partial^2{\mathcal{L}(\theta, w)}}{\partial{\theta}^2} \right)^{-1}\frac{\partial^2{L(w, \theta)}}{\partial{w}\partial{\theta}}
\end{equation}
So the total derivative can now be written as follows:
\begin{equation}
\frac{d\ J(\theta, w^*(\theta)}{d\theta} = \frac{\partial{ J(\theta, w)}}{\partial{ \theta}} + -\left( \frac{\partial^2{\mathcal{L}(\theta, w)}}{\partial{\theta}^2} \right)^{-1}\frac{\partial^2{L(w, \theta)}}{\partial{w}\partial{\theta}}\frac{\partial{ J(\theta, w)}}{\partial{\theta} }
\end{equation}
Note that the implicit jacobian term includes the hessian of the policy gradient $\frac{\partial^2{\mathcal{J}(\phi, \theta)}}{\partial{\theta}^2}$ which is in practice difficult to calculate. For practical purposes one can ignore this hessian term and do the first order approximation of the gradient and proceed with the gradient update. The gradient update of this stackelberg game is given by
\begin{equation}
\phi \leftarrow \phi - \eta \frac{d\ l(\phi, \psi(\phi)}{d\phi} \ \ \ \text{Critic player's step}
\end{equation}
\begin{equation}
\theta \leftarrow \theta + \alpha \frac{\partial{\mathcal{J}(\phi, \theta)} }{\partial{\theta}} \ \ \ \text{Policy player's step}
\end{equation}
where $\eta$ and $\alpha$ are the learning rates for curriculum and the policy players respectively.
## Hessian regularization
The implicit jacobian term involves a hessian $-\left( \frac{\partial^2{\mathcal{L}(\theta, w)}}{\partial{\theta}^2} \right)^{-1}$ which might be ill-conditioned. So the paper proposes a regularized version of hessian which makes the matrix well conditioned by adding $\lambda I$ to the hessian.