Optimistic Actor Critic

# Optimistic Actor Critic From the LP formulation of an infinite-horizon discounted MDP, the lagrangian can be written as: \begin{equation} L(v, \mu) = (1-\gamma)\alpha^\top v + \mu^\top (r+ (\gamma P - I)v) \end{equation} Which we see to solve via simultenous saddle-point optimization \begin{equation} \min_{v} \max_{\mu} L(v,\mu) = \max_{\mu} \min_v L(v,\mu) = L(v^\star, \mu^\star) \end{equation} ### Problem Recall that a state-occupancy measure can be factored as $\mu = \alpha^\top d$ with $\alpha \in \Delta(S)$ being the discounted visitation count and $d \colon S\times A \to [0,1]$ a randomized policy. Fix $\alpha$, and solve the following Lagrangian for $(v,d)$: \begin{equation} L_{\alpha}(v, d) = (1-\gamma)\alpha^\top \left[v + d^\top (r+ (\gamma P - I)v)\right] \end{equation} ### Stability The general iterates can be written as: \begin{align*} d_{k+1} &= \psi(v_k, d_k) \\ v_{k+1} &= \phi(v_k, d_k) \\ \end{align*} define the combined system as $z_k = \varphi(v_k, d_k)$, then $z_k^\star$ is a point of attraction if the eigenvalues of $\mid D\varphi(z_k)\mid \leq 1$. \begin{equation} D \varphi(z_k) = \begin{bmatrix} D_d \psi(v_k, d_k), D_v \psi(v_k, d_k) \\ D_d \phi(v_k, d_k), D_v \phi(v_k, d_k) \end{bmatrix} \end{equation} Applying GDA blindly: \begin{align} d_{k+1} &= d_k + \eta \left[ A \right] \\ v_{k+1} &= v_k - \eta \left [ d^\top (\gamma P - I ) + (1-\gamma) \alpha \right] \\ \end{align} Where $A = r + \gamma Pv - v$ \begin{equation} D \varphi(z_k) = \begin{bmatrix} I \quad, \quad \eta (\gamma P -I) \\ -\eta (\gamma P -I),\quad I \end{bmatrix} \end{equation} $det(D\varphi(z_k)) =I + \eta^2 (\gamma P - I)^2 \geq 0$ The coupled system is indeed stable, there is hope for saddle-point optimization. ### Optimistic Mirror Descent Let $B$ a bregman divergence of a strongly convex mirror function $b(\cdot)$. MP satisfies the following iterates: \begin{align} \hat{z}_{k+1} &= arg\min_{z} \eta D_z F(z_k)^\top z + B(z, z_k)\\ z_{k+1} &= arg\min_{z} \eta D_z F(\hat{z}_{k+1})^\top z + B(z, z_k)\\ \end{align} ### Naive application Which we can specialize to our setting for the value functions, for $B = \mid \mid \cdot \mid \mid_2^2 + H(\cdot)$: \begin{align} \hat{v}_{k+1} &= v_k - \eta D_vL_\alpha(v_k, d_k) \\ \hat{d}_{k+1} &\propto d_k e^{\eta \hat{A}_{k+1}} \end{align} and for the policies: \begin{align} v_{k+1} &= \hat{v}_{k+1}- \eta D_vL_\alpha(\hat{v}_{k+1}, \hat{d}_{k+1}) \\ d_{k+1} &\propto d_{k} e^{\eta A_{k+1}} \end{align} **Note** - Which norm to pick? - There is a mismatch between the two objectives? - One could allow communication between $v, d$ by sharing intermediate values? ### Ideally: Ideally, choosing $B(\mu)=\sum_{s,a} \mu(s,a) \log \frac{\mu(s,a)}{\mu'(s,a)}$ should give us the following iterates: **prediction** \begin{align} \hat{v}_{k+1} &= arg\min_v L_\alpha(v_k, d_k) \\ \hat{d}_{k+1} &\propto d_k e^{\hat{A}_{k+1}} \\ \end{align} **correction** \begin{align} v_{k+1} &= arg\min_v L_{\alpha}(\hat{v}_{k+1}, \hat{d}_{k+1}) \\ d_{k+1} &\propto d_k e^{A_{k+1}} \\ \end{align} ### Lagrangian Dynamics #### Value functions: **prediction** $D\phi(v,d) = D \hat{v}_{k+1} = (-\eta (\gamma P - I), I)$ **correction** \begin{equation} D v_{k+1} = \cases{(0, I) \quad if\hat{d}_{k+1} = d_k \\ (-\eta (\gamma P -I), I) \quad oth } \end{equation} With no communication between players, the extrapolation step does not carry new information to the value functions. #### Policy TODO **prediction** $D\phi(v,d) = D \hat{d}_{k+1} = (?, d_k^\top(G_i( \delta -G_j))$ **correction** ### What if we solve the regularized Lagrangian One can introduce the regularizer directly in the dual formulation and construct the regularized lagrangian as \begin{equation} L_{\alpha}(v, d) = (1-\gamma)\alpha^\top \left[v + d^\top (r+ (\gamma P - I)v)\right ] \end{equation} #### Open questions: - Note there is strange dynamics going on, because the value function depends on the policy, so when we do a mirror prox update ov the value functions should we use the predicted policy $\hat{d}_{k+1}$ or no? - What if one introduce some relation with the norm? - What happen if we introduce the regularizer directly in the lagrangian and we work with regularized game? - Which predictor should one use $\hat{z}_{k+1}$, latest iterate, average iterates, some filtered iterates? - How does the off policeness of the data affect the memory of the gradient predictor? - Can we use samples instead of exact gradients? - Does it converge? - What's the rate? - Can we do the same reasoning using only operators and the Legendre-Fenchel transform? #### References [opt in gam](https://arxiv.org/pdf/1311.1869.pdf) [opt in gam2](https://arxiv.org/pdf/2002.08493.pdf) [opt in reg game](https://arxiv.org/pdf/1507.00407.pdf) [opt in mdp](https://arxiv.org/pdf/1910.07072.pdf) [opt in opt](http://proceedings.mlr.press/v119/joulani20a/joulani20a.pdf) [opt in mdp2](https://arxiv.org/pdf/1909.10904.pdf) [opt in gan](https://arxiv.org/pdf/1807.02629.pdf)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.