# Optimistic Actor Critic
From the LP formulation of an infinite-horizon discounted MDP, the lagrangian can be written as:
\begin{equation}
L(v, \mu) = (1-\gamma)\alpha^\top v + \mu^\top (r+ (\gamma P - I)v)
\end{equation}
Which we see to solve via simultenous saddle-point optimization
\begin{equation}
\min_{v} \max_{\mu} L(v,\mu) = \max_{\mu} \min_v L(v,\mu) = L(v^\star, \mu^\star)
\end{equation}
### Problem
Recall that a state-occupancy measure can be factored as $\mu = \alpha^\top d$ with $\alpha \in \Delta(S)$ being the discounted visitation count and $d \colon S\times A \to [0,1]$ a randomized policy.
Fix $\alpha$, and solve the following Lagrangian for $(v,d)$:
\begin{equation}
L_{\alpha}(v, d) = (1-\gamma)\alpha^\top \left[v + d^\top (r+ (\gamma P - I)v)\right]
\end{equation}
### Stability
The general iterates can be written as:
\begin{align*}
d_{k+1} &= \psi(v_k, d_k) \\
v_{k+1} &= \phi(v_k, d_k) \\
\end{align*}
define the combined system as $z_k = \varphi(v_k, d_k)$, then $z_k^\star$ is a point of attraction if the eigenvalues of $\mid D\varphi(z_k)\mid \leq 1$.
\begin{equation}
D \varphi(z_k) =
\begin{bmatrix}
D_d \psi(v_k, d_k), D_v \psi(v_k, d_k) \\
D_d \phi(v_k, d_k), D_v \phi(v_k, d_k)
\end{bmatrix}
\end{equation}
Applying GDA blindly:
\begin{align}
d_{k+1} &= d_k + \eta \left[ A \right] \\
v_{k+1} &= v_k - \eta \left [ d^\top (\gamma P - I ) + (1-\gamma) \alpha \right] \\
\end{align}
Where $A = r + \gamma Pv - v$
\begin{equation}
D \varphi(z_k) =
\begin{bmatrix}
I \quad, \quad \eta (\gamma P -I) \\
-\eta (\gamma P -I),\quad I
\end{bmatrix}
\end{equation}
$det(D\varphi(z_k)) =I + \eta^2 (\gamma P - I)^2 \geq 0$
The coupled system is indeed stable, there is hope for saddle-point optimization.
### Optimistic Mirror Descent
Let $B$ a bregman divergence of a strongly convex mirror function $b(\cdot)$. MP satisfies the following iterates:
\begin{align}
\hat{z}_{k+1} &= arg\min_{z} \eta D_z F(z_k)^\top z + B(z, z_k)\\
z_{k+1} &= arg\min_{z} \eta D_z F(\hat{z}_{k+1})^\top z + B(z, z_k)\\
\end{align}
### Naive application
Which we can specialize to our setting for the value functions, for $B = \mid \mid \cdot \mid \mid_2^2 + H(\cdot)$:
\begin{align}
\hat{v}_{k+1} &= v_k - \eta D_vL_\alpha(v_k, d_k) \\
\hat{d}_{k+1} &\propto d_k e^{\eta \hat{A}_{k+1}}
\end{align}
and for the policies:
\begin{align}
v_{k+1} &= \hat{v}_{k+1}- \eta D_vL_\alpha(\hat{v}_{k+1}, \hat{d}_{k+1}) \\
d_{k+1} &\propto d_{k} e^{\eta A_{k+1}}
\end{align}
**Note**
- Which norm to pick?
- There is a mismatch between the two objectives?
- One could allow communication between $v, d$ by sharing intermediate values?
### Ideally:
Ideally, choosing $B(\mu)=\sum_{s,a} \mu(s,a) \log \frac{\mu(s,a)}{\mu'(s,a)}$ should give us the following iterates:
**prediction**
\begin{align}
\hat{v}_{k+1} &= arg\min_v L_\alpha(v_k, d_k) \\
\hat{d}_{k+1} &\propto d_k e^{\hat{A}_{k+1}} \\
\end{align}
**correction**
\begin{align}
v_{k+1} &= arg\min_v L_{\alpha}(\hat{v}_{k+1}, \hat{d}_{k+1}) \\
d_{k+1} &\propto d_k e^{A_{k+1}} \\
\end{align}
### Lagrangian Dynamics
#### Value functions:
**prediction**
$D\phi(v,d) = D \hat{v}_{k+1} = (-\eta (\gamma P - I), I)$
**correction**
\begin{equation}
D v_{k+1} = \cases{(0, I) \quad if\hat{d}_{k+1} = d_k \\
(-\eta (\gamma P -I), I) \quad oth
}
\end{equation}
With no communication between players, the extrapolation step does not carry new information to the value functions.
#### Policy
TODO
**prediction**
$D\phi(v,d) = D \hat{d}_{k+1} = (?, d_k^\top(G_i( \delta -G_j))$
**correction**
### What if we solve the regularized Lagrangian
One can introduce the regularizer directly in the dual formulation and construct the regularized lagrangian as
\begin{equation}
L_{\alpha}(v, d) = (1-\gamma)\alpha^\top \left[v + d^\top (r+ (\gamma P - I)v)\right ]
\end{equation}
#### Open questions:
- Note there is strange dynamics going on, because the value function depends on the policy, so when we do a mirror prox update ov the value functions should we use the predicted policy $\hat{d}_{k+1}$ or no?
- What if one introduce some relation with the norm?
- What happen if we introduce the regularizer directly in the lagrangian and we work with regularized game?
- Which predictor should one use $\hat{z}_{k+1}$, latest iterate, average iterates, some filtered iterates?
- How does the off policeness of the data affect the memory of the gradient predictor?
- Can we use samples instead of exact gradients?
- Does it converge?
- What's the rate?
- Can we do the same reasoning using only operators and the Legendre-Fenchel transform?
#### References
[opt in gam](https://arxiv.org/pdf/1311.1869.pdf)
[opt in gam2](https://arxiv.org/pdf/2002.08493.pdf)
[opt in reg game](https://arxiv.org/pdf/1507.00407.pdf)
[opt in mdp](https://arxiv.org/pdf/1910.07072.pdf)
[opt in opt](http://proceedings.mlr.press/v119/joulani20a/joulani20a.pdf)
[opt in mdp2](https://arxiv.org/pdf/1909.10904.pdf)
[opt in gan](https://arxiv.org/pdf/1807.02629.pdf)