# Extension to Bregman divergence
[Toc]
Other Hackmd: https://hackmd.io/6JxqDj-zSfOvCmPZzAZDcg?view
### Single Call extragradient methods
From https://hal.inria.fr/hal-02403555/document
$V : \mathbb R^d \rightarrow \mathbb R^d$ vector field, $\mathcal X$ closed convex set.
- PEG:
$$\left\{\begin{aligned}
&X_{t+1/2} = \Pi_{\mathcal X}(X_t-\gamma V_{t-1/2})\\
&X_{t+1} = \Pi_{\mathcal X}(X_t-\gamma V_{t+1/2})
\end{aligned}\right.$$
- RG:
$$\left\{
\begin{aligned}
&X_{t+1/2} = X_t - (X_{t-1} - X_t)\\
&X_{t+1} = \Pi_{\mathcal X}(X_t-\gamma V_{t+1/2})\\
\end{aligned}
\right.$$
- OG:
$$\left\{
\begin{aligned}
&X_{t+1/2} = \Pi_{\mathcal X}(X_t-\gamma V_{t-1/2})\\
&X_{t+1} = X_{t+1/2} - \gamma(V_{t+1/2} - V_{t-1/2})\\
\end{aligned}
\right.
$$
### Extension to Bregman divergences
Given a $K$-strongly convex DGF $h$ on $\mathcal X$, $X_1 \in \mathcal X$ and $P_{x}(y)$ prox mapping,
Past Extra-Gradient (PEG):
: $$\left\{\begin{aligned}
&X_{t+1/2} = P_{X_t}(-\gamma_t V_{t-1/2})\\
&X_{t+1} = P_{X_t}(-\gamma_t V_{t+1/2})
\end{aligned}\right.
$$
Reflected Gradient (RG):
: see below
Optimistic Gradient (OG):
: $$ \begin{aligned}
&X_{t+1/2} = P_{X_{t-1/2}}\left(-\gamma_t V_{t-1/2} + \gamma_{t-1}(V_{t-3/2} - V_{t-1/2})\right)\\
\end{aligned}
$$
(and $X_{t+1} = X_{t+1/2} - \gamma(V_{t+1/2} - V_{t-1/2})$ ?)
## For the 10/06
### Local last-iterate convergence with local strong monotonicty on the boundary
Context:
: Local Last-iterate convergence in the deterministic and stochastic setting with local strong monotonicty for PEG, i.e. with
for $x \in \mathcal X \cap \bar B(x^*, r)$,
$$ \langle V(x), x - x^*\rangle \geq \alpha \|x^* - x\|^2$$
(and $V$ locally $\beta$-Lipschitz)
To relate $D(x^*, X_{t+1/2})$ to $D(x^*, X_t)$, as discussed last time, we used that, for $x \in \mathcal X \cap \bar B(x^*, r)$,
$$D(x^*, x) \leq \frac{L}{2}\|x^* - x\|^2\tag{*}$$
which holds if $h$ locally $\mathcal C^2$ and $x^* \in \mathrm{ri} \mathcal X$ or for the simplexe.
On the boundary and with Entropy (not on the simplex), Hellinger, Fractional Power, this condition $(*)$ does not hold.
Indeed, they can be written as $h(x) = \sum_{i} \tilde h(x_i)$.
Then, with $$I =\{i: x^*_i \in dom \partial \tilde h\}$$ and so$$I^C = \{i: x^*_i \in dom \tilde h \setminus dom \partial \tilde h\}$$,
$$D(x^*, x) = {D(x^*_I, x_I)} + \Theta\left(\sum_{i \notin I} |x_i^* - x_i|^p\right)$$
with $D(x^*_I, x_I) \leq \frac{L}{2}\|x_I^* - x_I\|^2$ and $p \in (0, 1]$.
Though I did not manage to show different convergence speeds for the coordinates in $I$ and $I^C$ as we briefly mentionned last time, I at least have last iterate convergence.
More precisely, if instead of $(*)$ we assume that, locally,
$$D(x^*, x)^{1+\alpha} \leq \frac{L}{2}\|x - x^*\|^2\,, \tag{**}$$
we have that, locally, for PEG,
$$D(x^*, X_t) = O\left(\frac{1}{T^{1/\alpha}}\right)$$
For instance, if we have the decomposition above,
$$D(x^*, x) = {D(x^*_I, x_I)} + O\left(\sum_{i \notin I} |x_i^* - x_i|^p\right)$$
then $\alpha = \frac{2}{p} - 1 \geq 1$.
Caveat:
: In this case, $L$ depends on a constant $c$ such that $\|.\|_p \leq c\|.\|$. Depdending on the relation between $\|.\|_p$ and $\|.\|$ this may not be dimension-free.
Next step?:
: Check in the stochastic setting? (WIP)
Plots:
: Here are various examples with the Hellinger geometry,
$$h(x) = \sum_i - \sqrt{1 - x_i^2}$$
on $[-1, 1]^d$. In the following, the blue curbs correspond to coorinates $i$ such that $x_i^* \in (-1, 1)$ and the red to coordinates $i$ such that $x_i^* \in \{-1,+1\}$.
Each curb is$$|x_i(t) - x_i^*|$$
Consider,
$$\min_{x \in [-1, 1]^d} \|x - x^*\|_2^2$$


Consider,
$$\min_{x \in [-1, 1]^d} (x - x^*)^TS(x-x^*)$$
where $S$ random positive definite matrix.


Consider,
$$\min_{x \in [-1, 1]}\max_{y \in [-1, 1]} (x - x^*)^T(y-y^*)$$

### Error-bounds
They are two main problems with the EB I presented last-time:
1. dependance on $\gamma$
2. not applicable to PEG
For the first point, Tseng actually solves this problem (in later papers) by using the following lemma (constrained Euclidean setting).
If $X$ closed convex, for any $x \in X$ and $y \in \mathbb R^d$, $$\alpha \mapsto \frac{\|x - \Pi_{X}(x + \alpha y)\|}{\alpha}$$
is non-increasing.
For now, I have not managed to adapt this to the Bregman setting... Do you think a similar result still holds ?
:::info
Details: what I would need is something like,
$$\langle \nabla h(x) - \nabla h (y), x - y\rangle \langle \nabla h(w) - \nabla h (z), w - z\rangle \geq c\langle \nabla h(x) - \nabla h (y), w - z\rangle ^2$$ **with $c=1$**
:::
For the 2nd point, I have a partial answer (in the interior to solve the pb of step-size).
Assume $\mathrm{ri} \mathcal X \cap \mathcal X^* \neq \emptyset$. If one of the following holds:
1. $V$ (locally) strongly monotone
2. $\mathcal X$ polyhedra, $V$ affine
3. $\mathcal X$ polyhedra, $\nabla V(x^*)$ is "regular",
$$\nabla V(x^*)^{-1}(\mathrm{aff} \mathcal X - x^*)^{\bot} \cap (\mathrm{aff} \mathcal X - x^*) = \emptyset$$
Then there exists $\tau = \tau(A, V) > 0$ such that,
$\forall x_1, x_2 \in X,\ \gamma > 0$ s.t. $x^+ = P_{x_1}(-\gamma V(x_2)) \in \mathrm{ri} \mathcal X$, $$\exists x^* \in X^*,\ \tau\|x_2 - x^*\| \leq \|x_2 - x^+\| + \frac{1}{\gamma}\|\nabla h(x_1) - \nabla(x^+)\|_*$$
(only locally for 1. and 3.)
Does this seem reasonable ?
**Consequence**
In the determinstic case, if (locally),
- V $\beta$ Lipschitz
- V satisfies EB with $\tau > 0$
- V $\alpha$ strongly monotone
- $h$ $L$ - smooth ($K$ strongly convex)
Geometric convergence of PEG with $\gamma = O(K/\beta)$ and rate,
$$1 - \frac{\alpha \gamma}{L} - \frac{K \tau^2 \gamma^2}{16 L^3}$$
## For the 4/06
### Local last-iterate convergence with local strong monotonicty
Done:
: Local Last-iterate convergence in the deterministic and stochastic setting with local strong monotonicty for PEG, i.e. with
for $x \in \mathcal X \cap \bar B(x^*, r)$,
$$ \langle V(x), x - x^*\rangle \geq \alpha \|x^* - x\|^2$$
(and $V$ locally $\beta$-Lipschitz)
To relate $D(x^*, X_{t+1/2})$ to $D(x^*, X_t)$, as discussed last time, I used that, for $x \in \mathcal X \cap \bar B(x^*, r)$,
$$D(x^*, x) \leq \frac{L}{2}\|x^* - x\|^2\tag{*}$$
which holds if $h$ locally $\mathcal C^2$ and $x^* \in \mathrm{ri} \mathcal X$.
Question 1:
: Is it interesting to look at what happens when $x^*$ is on the boundary ? If it is, I was wondering if you had any advice, in partciular with the reccurence (see below).
More exactly, assume that $dom\ \partial h \subset \mathcal X = dom\ h$ and we look at $x^* \in dom\ h \setminus{dom\ \partial h}$.
- Euclidean constrained case: no problem for $(*)$
- Simplex: no problem as we can restrict to non-zero coordinates:
$$D(x^*, x) = \sum_{i}x^*_i \log\frac{x^*_i}{x_i} = \sum_{i: x_i > 0} x^*_i\log\frac{x^*_i}{x_i}$$
So, if $I = \{i: x^*_i > 0\}$, on $\bar B(x^*, r)$,
$$D(x^*, x) \leq \frac{1}{2}\max_{i \in I} \frac{1}{x^*_i - r}\left(\sum_{i \in I} |x_i^* - x_i|\right)^2$$
- Entropy (not on the simplex), Hellinger, Fractional Power:
They can be written as $h(x) = \sum_{i} \tilde h(x_i)$.
Then, with $$I =\{i: x^*_i \in dom \partial \tilde h\}$$ and so$$I^C = \{i: x^*_i \in dom \tilde h \setminus dom \partial \tilde h\}$$,
$$D(x^*, x) = {D(x^*_I, x_I)} + \Theta\left(\sum_{i \notin I} |x_i^* - x_i|^p\right)$$
with $D(x^*_I, x_I) \leq \frac{L}{2}\|x_I^* - x_I\|^2$ and $p \in (0, 1]$.
One can then show inequalities of the form,
$$
D(x^*, X_{t+1}) + \mu_{t+1} \leq \left(1 - \frac{\gamma \alpha}{L}\right)\left(D(x^*_I, x_I) + \mu_t\right) + \lambda\sum_{i \notin I} |x_i^* - x_i|^p
$$
(where $\mu_t = \frac{\gamma^2}{2K}\|V_{t+1/2} - V_{t-1/2}\|^2$)
$$
D(x^*, X_{t+1}) + \mu_{t+1} \leq
D(x^*, X_{t}) + \mu_{t} - \frac{\alpha\gamma}{L}\|X_t - x^*\|^2$$
However, I do not know where to go from here...
Question 2:
: For the local stochastic setting, we get a sublinear rate with a vanishing step-size. Can I try to extend the stochastic analysis of YG to handle the case of a fixed step-size ?
### Error-bounds
Done: Adapted a local Error-bound from [Luo, Tseng, 1992](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.158.6238&rep=rep1&type=pdf) to the Bregman case.
If
- $X = \{x: Ax \leq b\}$ polyhedra,
- $K \subset dom\ \partial h \subset X$ compact,
- $V$ affine
- $\gamma > 0$ **fixed**,
there exists $\delta, \tau > 0$, such that, if $x \in K$ and $x^+ = P_x(-\gamma V(x))$,
$$\|x - x^+\| \leq \delta \implies d(x, X^* \cap K) \leq \tau(\|x - x^+\| + \|\nabla h(x) - \nabla(x^+)\|)$$
Questions:
: Can I continue in this direction ? For instance, what is the dependance on $\gamma$ ? Can we prove a similar bound with strong monotonicty (instead of $V$ affine) ?
Caveat:
: However, this is not the right quantity for PEG... Indeed, we want a lower bound on
$$D(X_{t+1/2}, X_t) = D(P_{X_t}(-\gamma V(X_{t-1/2})), X_t)$$
On the discussion on Slack:
: However, for PEG, in the interior, $X_{t+1/2} = X_t$ still implies that $X_{t-1/2}$ is solution. The "weak" sense is also true.
If $X_t$ is a sequence whose cluster points are in $dom \partial h$ , $\gamma > 0$, and $$X_{t+1/2} - X_t \rightarrow 0$$, there is a subsequence which converges to a solution.
### Generalization of RG
Update:
: There was a mistake in my original proof. There are two possible fixes for now.
Question:
: Is one of these fixes still interesting to you ?
1. Do not use PEG but MP. The proof of ergodic convergence works in this case.
2. Essentially, "treat the difference between MP and PEG as noise", but a vanishing step size is needed. For the simplex the result is still dimension-free however.
Details for 2.:
: The resulting bound is the following, for $p \in \mathcal X_2$, $X_{1/2} \in \mathcal X$, $X_1 \in \mathrm{dom} \partial h_2$,
$$
\langle V(p), \overline{X_t} - p \rangle \leq \frac{D(p, X_1)}{\sum_{s=1}^t \gamma_t} + \frac{3\beta^2(K_1 + K_2)^2}{K_1^2K_2^2} \frac{\sum_{s=1}^t \gamma_t^2\gamma_{t-1}^2 \|V_{t-3/2}\|^2_*}{\sum_{s=1}^t \gamma_t}\,,
$$ where $\overline{X_t} = \frac{\sum_{s=1}^t \gamma_s X_{s+1/2}}{\sum_{s=1}^T \gamma_s}$
(I refined the noise term compared to last week, in $O(\gamma_t^4)$ instead of $O(\gamma_t^2)$ )
Details about the method:
: Consider $h_1, h_2$ dgf (strongly convex, with a continuous selection of subgradients), defined respectively on $\mathcal X_1$ and $\mathcal X_2$, s.t.
- $h_2 - K_{2}h_1$ is convex, $h_1$ is $K_1$-strongly convex
- $\mathcal X_2\subset \mathcal X_1$ closed convex sets.
- $dom \partial h_2 \subset dom \partial h_1$.
For $i \in \{1, 2\}$, define the prox-mapping,
$$ P_x^i(y) = \mathrm{argmin}_{x' \in \mathcal X_i} \langle y, x - x'\rangle + D_{h_i}(x', x) $$
Consider the method,
$$\left\{\begin{aligned}
&X_{t+1/2} = P_{X_t}^1\left(-\frac{\gamma_t}{K_2} V_{t-1/2}\right)\\
&X_{t+1} = P_{X_t}(-\gamma_t V_{t+1/2})
\end{aligned}\right.
$$
## For the 28/05
### Local last-iterate convergence with local strong monotonicty
Done:
: Local Last-iterate convergence in the deterministic setting with local strong monotonicty for PEG, i.e. with
for $x \in \mathcal X \cap B(x^*, r)$,
$$ \alpha D(x^*, x) \leq \langle V(x), x - x^*\rangle $$
(and locally $\beta$-Lipschitz)
Question:
: For this I used that, locally, there exists $\theta > 0$ for $x, y, z \in B(x^*, r) \cap \mathcal X$,
$$ \theta(D(x, y) + D(y, z)) \geq D(x, z)$$
(analogue of the Young inequality)
This is true, if for instance, locally, $D(x^*, z) \leq M\|x-z\|^2$.
As it is only local, it seems acceptable to me. Do you find this acceptable ?
Why:
: I need it to relate $D(x^*, X_{t+1/2})$ to $D(x^*, X_t)$. Indeed, the strong monotonicty gives that,
$$ \alpha D(x^*, X_{t+1/2}) \leq \langle V(X_{t+1/2}), X_{t+1/2} - x^*\rangle $$
With this assumption, we get
$$\langle V(X_{t+1/2}), X_{t+1/2} - x^*\rangle \geq \frac{\alpha}{\theta} D(x^*, X_t) - \alpha D(X_{t+1/2}, X_t)$$
Impact on the result:
: no dependance on $\theta$ in the step-size, only in the rate of convergence which is,
$$D(x^*, X_t) \leq (1 - \alpha \gamma/\theta)^t D(x^*, X_1)$$
Details:
: Like in the 1EG paper, this is done in two steps,
1. If $X_1$ is close enough to $x^*$ and if $\gamma \leq \frac{K}{2\sqrt 2 \sqrt{\beta^2 + \|V(x)\|_*^2/r^2}}$, the iterates stay in $B(x^*, r)$.
2. Geometric convergence as described above.
Side benefit:
: Assuming that, locally, $D(x^*, z) \leq M\|x^*-z\|^2$, implies that the local strong monotonity with Bregman,
for $x \in \mathcal X \cap B(x^*, r)$,
$$ \alpha D(x^*, x) \leq \langle V(x), x - x^*\rangle $$
is equivalent to the classical strong monotonicty assumption.
### Next direction
Question:
: Are we still first focusing on strong-monotonicity like conditions ? Or are we also moving towards more general error-bounds ?
Difficulties the error-bound we discussed last-time:
: Last time, we discussed error-bounds involving the quantity $D(x^+, x)$ where $x^+ = P_x(-\gamma V(x))$. While assumptions on this quantity lead to last-iterate convergence for MP, this is not obvious for PEG. Indeed, in both cases, the quantity we want to use in the proof is
$$D(X_{t+1/2}, X_t)$$
For MP, $D(X_{t+1/2}, X_t) = D(P_{X_t}(-\gamma V(X_t)), X_t)$ so this fits.
For PEG, $D(X_{t+1/2}, X_t) = D(P_{X_t}(-\gamma V(X_{t-1/2})), X_t)$, which is not of the required form. Moreover, this does not seem to be easily used as an error-bound, as it seems to me that $X_{t+1/2} = X_t$ does not necessarily imply that $X_t$ is solution.
Another problem is that $D(x^+, x)$ may not behave well at the boundary. For the simplex, $D(x^+, x) \rightarrow 0$ when $x$ gets closer to an extremal point (see the graphics [here](#Update-2605)).
### Generalization of RG
Update:
: There was a mistake in my original proof. There are two possible fixes for now.
Question:
: Is one of these fixes still interesting to you ?
1. Do not use PEG but MP. The proof of ergodic convergence works in this case.
2. Essentially, "treat the difference between MP and PEG as noise", but a vanishing step size is needed. For the simplex the result is still dimension-free however.
Details for 2.:
: The resulting bound is the following, for $p \in \mathcal X_2$, $X_{1/2} \in \mathcal X$, $X_1 \in \mathrm{dom} \partial h_2$,
$$
\langle V(p), \overline{X_t} - p \rangle \leq \frac{D(p, X_1)}{\sum_{s=1}^t \gamma_t} + \frac{\beta^2 diam(\mathcal X_1)}{K_1K_2} \frac{\sum_{s=1}^t \gamma_t^2}{\sum_{s=1}^t \gamma_t}\,,
$$ where $\overline{X_t} = \frac{\sum_{s=1}^t \gamma_s X_{s+1/2}}{\sum_{s=1}^T \gamma_s}$
(currently trying to refine the noise term, probably in $O(\gamma_t^4)$ instead of $O(\gamma_t^2)$ )
Details about the method:
: Consider $h_1, h_2$ dgf (strongly convex, with a continuous selection of subgradients), defined respectively on $\mathcal X_1$ and $\mathcal X_2$, s.t.
- $h_2 - K_{2}h_1$ is convex, $h_1$ is $K_1$-strongly convex
- $\mathcal X_2\subset \mathcal X_1$ closed convex sets.
- $dom \partial h_2 \subset dom \partial h_1$.
For $i \in \{1, 2\}$, define the prox-mapping,
$$ P_x^i(y) = \mathrm{argmin}_{x' \in \mathcal X_i} \langle y, x - x'\rangle + D_{h_i}(x', x) $$
Consider the method,
$$\left\{\begin{aligned}
&X_{t+1/2} = P_{X_t}^1\left(-\frac{\gamma_t}{K_2} V_{t-1/2}\right)\\
&X_{t+1} = P_{X_t}(-\gamma_t V_{t+1/2})
\end{aligned}\right.
$$
### Less important: (Quasi-) Fejer analysis with Bregman
Question:
: Given $\mathcal X$ closed convex, $h : \mathcal X \rightarrow \mathbb R$ with a conitnuous selection of subgradients on $dom\ \partial h$, does it always hold that,
if $x_t \in dom\ \partial h$ and $x_t \rightarrow p$ with $p \in \mathcal X \setminus dom\ \partial h$,
$$\forall x \in dom\ \partial h,\ D(x, x_t) \rightarrow \infty$$
Why:
: This would guarantee that subsequences of the sequence of iterates cannot go to the boundary. But I am not sure yet we will need it.
## For the 26/05
### Summary/Todo
- Application: routing with polynomial loss (see Pan' paper or chap. 18 of the book we mentionned)
- Error bound: replace the strict positivity with something like,
$$
D(x^+, x) \geq \epsilon\quad\text{if}\quad d(\mathcal X^*, x) \geq \delta
$$ (we were not sure whether this is equivalent to requiring only the strict positivity if $\mathcal X$ is compact). We could also assume that the strong monotonicity assumption holds only locally (to avoid the problem with boundary).
(See (quasi-) Fejer analysis)
- Generalization of RG: can this give a **dimension-free** algorithm for a sub-polyhedron of the simplex ? ie, the extrapolation step would be a MW update and only the update step would also involve the projection on the sub-polyhedron (this would be cool if this worked according to Pan').
Also can $K_2$ be chosen $\leq 0$ (similar to non-convex prox mehods) ?
### Update 26/05
Extension of RG:
: See https://hackmd.io/sUkLJF2IT4CCEUizg5Y6jA#Generalization-of-RG1 for details.
Actually there is currently a mistake in my proof, but I may be able to fix it (as it still works for MP).
Error-bound $D(x^+, x)$:
: Strange behavior on the boundary.
For the simplex/MWU, $\partial h$ is only defined on $\mathrm{ri} \mathcal X$ so that $P_x(y)$ as an argmin is only defined for $x \in \mathrm{ri} \mathcal X$. If $x \in \mathrm{dom} h$, the following equivalence is true,
$$x \in X^* \iff x = P_x(-\gamma V(x))$$
However, for the simplex/MWU, the expression of $P_x(y)$ can actually be extended to the boundary and the extremal points $(0,\dots,0,1,0\dots,0)$ are fixed points! This explains why $D(x^+, x)$ goes to zero near these vertices.
As an illustration, consider $\mathcal X = \{(p, 1-p): p \in [0,1]\}$ and the pb of minimizing $\|.\|^2_2$ on $\mathcal X$, with $x^* = (1/2, 1/2)$.

With a two-player game with a unique equilibria at $x^* = (1/2, 1/2)$, $y^*=(1/2, 1/2)$,

Last-iterate convergence:
: It is a bit confusing so
- Using $D(x^+, x) \geq \tau D(x^*, x)$, geometric convergence for *MP*. :arrow_right: Extend to PEG (?)
(same for $D(x^+, x) \geq \delta$)
- Using local strong mon. I am currently working on it for MP/PEG similarly to the 1EG paper. (I already have that the iterates stay on a neighborhood of $x^*$ (if initiliazed close enough)).
Maybe try to extend the stochastic analysis with strong mon first ?
On last-iterate convergence, maybe https://arxiv.org/pdf/2002.00057.pdf is relevant.
(We already have ergodic convergence in the stochastic setting for PEG).
## Details
### Questions (for another Thursday)
- Bias ?
### Generalization of RG
I did not manage to get a method looking like RG, but I can at least "weaken" the first projection step.
Consider $h_1, h_2$ dgf (strongly convex, with a continuous selection of subgradients), defined respectively on $\mathcal X_1$ and $\mathcal X_2$, s.t.
- $h_2 - K_{2}h_1$ is convex, $h_1$ is $K_1$-strongly convex
- $\mathcal X_2\subset \mathcal X_1$ closed convex sets.
- $dom \partial h_2 \subset dom \partial h_1$.
For $i \in \{1, 2\}$, define the prox-mapping,
$$ P_x^i(y) = \mathrm{argmin}_{x' \in \mathcal X_i} \langle y, x - x'\rangle + D_{h_i}(x', x) $$
Consider the method,
$$\left\{\begin{aligned}
&X_{t+1/2} = P_{X_t}^1\left(-\frac{\gamma_t}{K_2} V_{t-1/2}\right)\\
&X_{t+1} = P_{X_t}(-\gamma_t V_{t+1/2})
\end{aligned}\right.
$$
Then, we have ergodic convergence (deterministic), for $p \in \mathcal X_2$, $X_{1/2} \in \mathcal X$, $X_1 \in \mathrm{dom} \partial h_2$,
$$
\langle V(p), \overline{X_t} - p \rangle \leq \frac{\frac{K_1K_2}{2}\|X_1 - X_{1/2}\| + D(p, X_1)}{\sum_{s=1}^t \gamma_t}\,,
$$ where $\overline{X_t} = \frac{\sum_{s=1}^t \gamma_s X_{s+1/2}}{\sum_{s=1}^T \gamma_s}$
For instance, if $h_1 = \frac{1}{2}\|.\|^2_2$, this becomes,
$$\left\{\begin{aligned}
&X_{t+1/2} = {X_t} - \frac{\gamma_t}{K_2} V_{t-1/2}\\
&X_{t+1} = P_{X_t}(-\gamma_t V_{t+1/2})
\end{aligned}\right.
$$
:arrow_right: as RG, no projection for the extrapolation step.
Can this approach be useful in another setting ?
### Error-bound/last-iterate
- Analogue of strong monotonicty., for $x^* \in \mathcal X^*$, $x \in \mathcal X$,
$$ \mu D(x^*, x) \leq \langle V(x), x - x^*\rangle $$
For instance, ensures geometric convergence of the last-iterate of MD. But not satisfied by linear games, even in the simplex setting (if $x^* \in \mathrm{ri} \mathcal X$).
- Analogue of Tseng's error bound
Original bound from [Tseng, 1995](https://core.ac.uk/download/pdf/82171354.pdf):
$$\|x - \Pi(x - \gamma V(x))\| \geq \tau d(\mathcal X^*, x)$$
for $x$ sufficiently close to $\mathcal X^*$.
It has been show to hold for instance when $V$ is affine and $\mathcal X$ is polyhedral, see [Luo, Tseng, 1992](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.158.6238&rep=rep1&type=pdf).
A natural extension, if $x^+ = P_x(-\gamma V(x))$,
$$D(x^+, x) \geq \tau^2 D(\mathcal X^*, x)$$
(This would easily imply geometric convergence of Mirror-prox, see Lemma D.2 of (Panayotis, 2018))
I tried, unsuccessfully, to prove such a bound when $V$ is linear and $\mathcal X$ a polyhedra (compact or not). The proof of Luo & Tseng heavily uses that $\nabla h$ is linear, so that the KKT conditions for $$\min_{x' \in \mathcal X} \|x - \gamma V(x) - x'\|$$
are linear. However, in the Bregman setting, this condition involves $\nabla h$. Indeed, $x^+ = P_x(-\gamma V(x))$ is characterized by,
$$\nabla h(x) - \gamma V(x) - \nabla h(x^+) \in N_{\mathcal X}(x^+)$$
For $\mathcal X = \{x: Ax \leq b\}$, if $I(x^+)$ denote the set of indices of the active constraintes,
$$\nabla h(x) - \gamma V(x) - \nabla h(x^+) = A^T\lambda$$
with $\lambda_i = 0$ for $i \notin I(x^+)$.
I also tried to disprove this bound with a two-player linear game on the simplex but without success for now...
- Maybe we can use something weaker like,
$$D(x^+, x) > 0 \text{ if } x \notin \mathcal X^*$$
- Relation to https://arxiv.org/abs/1807.04252 and https://arxiv.org/abs/2002.06768 ? (ie encompass their setting ?).
### OG: weaken steepness to encompass the the Eucliden setup (less important)
I did not manage to do this. The proof of convergence of OG in the Euclidean setting by YG for instance heavily uses the Euclidean structure (substraction,...).
Optimistic Gradient (OG):
: $$ \begin{aligned}
&X_{t+1/2} = P_{X_{t-1/2}}\left(-\gamma_t V_{t-1/2} + \gamma_{t-1}(V_{t-3/2} - V_{t-1/2})\right)\\
&X_{t+1} = P_{X_t}(- \gamma(V_{t+1/2}))\\
\end{aligned}
$$
The difficulty in trying to analyse OG as a perturbation of PEG is that the "base point" of the prox step of $X_{t+1/2}$ and $X_{t+1}$ differ.
Steepness (from Pan'): $h$ is **steep**, if, $\mathrm{dom} \partial h = \mathrm{ri}(\mathcal{X})$. In this case,
$$P_x(y_1 + y_2) = P_{P_x(y_1)}(y_2)$$
Under this assumption, for PEG,
$$
\begin{aligned}
P_{X_t}(-\gamma_t V_{t-1/2}) &= P_{X_{t-1}}(-\gamma_{t-1}V_{t-3/2}-\gamma_t V_{t-1/2})\\
&=P_{X_{t-1/2}}\left(-\gamma_t V_{t-1/2} + \gamma_{t-1}(V_{t-3/2} - V_{t-1/2})\right)
\end{aligned}
$$
where in the last line we used that $X_{t-1/2} = P_{X_{t-1}}(-\gamma_{t-1}V_{t-3/2})$ so that $X_{t-1} = P_{X_{t-1/2}}(+\gamma_{t-1}V_{t-3/2})$ (with the steepness assumption).
## Old
Surpisingly, for MWU, i.e. with the simplex as $\mathcal X$ and the negative entropy as $h$, OG and PEG coincide. This is due to the following relation, for MWU,
$$P_x(y_1 + y_2) = P_{P_x(y_1)}(y_2)$$
### Ergodic stochastic convergence for PEG (adapted from A. Juditsky)
(cf https://projecteuclid.org/euclid.ssy/1393252123)
$V : \mathbb R^d \rightarrow \mathbb R^d$ vector field, $\mathcal X$ closed convex subset of $\mathbb R^d$, $\|.\|$ norm on $\mathbb R^d$, and denote by $\|.\|_*$ its dual norm, defined by,
$$
\|y\|_* = \sup \{\langle y, x \rangle : x \in \mathcal X\}\,.
$$
Assumptions:
1. $h$ $1$-strongly convex DGF (with a continuous selection of subgradients).
2. $V$ satisfy,
$$\|V(x) - V(x')\|_* \leq \beta\|x - x'\| + M$$
3. $V$ is monotone
4. $\hat V_t = V(X_t) + Z_t$ where $Z_t$ satisfies,
$$
\mathbb E(Z_t | \mathcal F_t) = 0 \quad \text{and} \quad \mathbb E(\|Z_t\|_*^2 | \mathcal F_t) \leq \sigma^2
$$
where $\mathcal F_t$ is the natural filtration associated to $\mathcal (X_t)_t$.
Then, for $p \in \mathcal X$, $X_{1/2} \in \mathcal X$, $X_1 \in dom \partial h$,
$$
\langle V(p), \overline{X_t} - p \rangle \leq \frac{\frac{1}{2}\|X_1 - X_{1/2}\| + D(p, X_1) + 3(M^2 + 4\sigma^2)\sum_{s = 1}^t \gamma_s^2}{\sum_{s=1}^t \gamma_t}\,,
$$
where $\overline{X_t} = \frac{\sum_{s=1}^t \gamma_s X_{s+1/2}}{\sum_{s=1}^T \gamma_s}$
Descent inequality:
For any $p \in \mathcal X$, $X_1 \in dom\ \partial h$, $t \geq 1$, and $\gamma \leq \frac{K}{2\beta}$,
$$
D(p, X_{t+1}) \leq D(p, X_t) - \gamma_t \langle \hat V_{t+1/2}, X_{t+1/2} - p \rangle + \mu_t - \mu_{t+1} + \delta_t + \delta_{t-1} \,,
$$
where,
$$
\left\{\begin{aligned}
\mu_1 &= \frac{3\gamma_{t-1}^2\beta^2}{2}\|X_1 - X_{1/2}\|^2\\
\mu_t &= {6\gamma_1^2\beta^2}\|X_{t-1/2} - X_{t-3/2}\|^2\,,\quad \forall t \geq 2\,.
\end{aligned}\right.
$$
$$
\left\{\begin{aligned}
\delta_0 &= 0\\
\delta_t &= \frac{3\gamma_{t-1}^2}{2}\left(M^2 + 2\|Z_{t+1/2}\|_*^2 + 2\|Z_{t-1/2}\|_*^2\right)\,,
\end{aligned}\right.
$$
### Ergodic deterministic convergence for PEG
$V : \mathbb R^d \rightarrow \mathbb R^d$ vector field, $\mathcal X$ closed convex subset of $\mathbb R^d$, $\|.\|$ norm on $\mathbb R^d$, and denote by $\|.\|_*$ its dual norm, defined by,
$$
\|y\|_* = \sup \{\langle y, x \rangle : x \in \mathcal X\}\,.
$$
Assumptions:
1. $h$ $K$-strongly convex dgf (with a contunous selection of subgradients).
2. $V$ is $\beta$ Lipschitz, i.e.,
$$\|V(x) - V(x')\|_* \leq \beta\|x - x'\|$$
3. $V$ is monotone
Descent inequality:
For any $p \in \mathcal X$, $X_1 \in dom\ \partial h$, $t \geq 1$, and $\gamma \leq \frac{K}{2\beta}$,
$$
D(p, X_{t+1}) \leq D(p, X_t) - \gamma \langle V(X_{t+1/2}), X_{t+1/2} - p \rangle + \mu_t - \mu_{t+1} \,,
$$
where,
$$
\left\{\begin{aligned}
\mu_1 &= \frac{2\gamma^2\beta^2}{K}\|X_1 - X_{1/2}\|^2\\
\mu_t &= \frac{\gamma^2\beta^2}{2K}\|X_{t-1/2} - X_{t-3/2}\|^2\,,\quad \forall t \geq 2\,.
\end{aligned}\right.
$$
## Less important questions
- Use of biased stochastic oracles ?
- Use of the $y_\tau$ sequence in A.Judistky, 2011
- Bounded moment of gradient vs bounded moment of noise
- Compactness assumption