[Public] MWU Proof

# [Public] MWU Proof By Danupon Nanongkai Notes: * While I tried to make this note self-contained, its main purpose is to supplement the lecture. * I usually start with precise mathematical statements, followed by the discussions. If you are lost in the math, look at the discussions for some help. ## Notations * For any vector $x$, let $\max(x):=\max_i x_i$ and $\|x\|_\infty=\max_i |x_i|$ * See the previous lecture for some more notations. * For any vector $x$, define $x^2$ as $(x^2)_i:=x_i^2$. ## 1. Recall: The MWU algorithm and its guarantee Recall that $g^t\in [0, \rho]^n$. **Algorithm MWU($n,T,\epsilon, \rho, \{g^t\}_{t=1}^T$):** * Let $w^1\in \mathbb{R}_+^n$ be $(1,1,1,...)$ * For $t=1, 2, ..., T$: 1. Let $p_i^t=\frac{w_i^t}{\sum_j w_j^t}$ 2. For all $i$, $w^{t+1}_i\leftarrow w_i^t \cdot \exp(\epsilon g^t_i/\rho)$. **Definitions:** * $G:=\sum_t g_i^t$. * $G_*:=\max_i G_i = \max_i \sum_t g_i^t$. * $G_A:=\sum_{t=1}^T \langle p^t, g^t\rangle$. **Theorem 1.** For any $T\geq \frac{\rho\ln(n)}{\epsilon^2}$, we have $\frac{G_*}{T}\leq (1+\epsilon) \frac{G_A}{T}+\epsilon$. ## 2. Our main theorem today **Theorem 2.** For $\rho=1$ and any $T$, we have $G_*\leq (1+\epsilon) G_A + \frac{\ln n}{\epsilon}$. **Claim:** Theorem 2 implies Theorem 1 **Proof:** *Comment:* It might be easier to prove this yourself than reading the proof. * Let $\rho$ be as in Theorem 1. Let $g^1, g^2, ..., g^T$ be the vectors in $[-\rho,\rho]^n$. * Define $h^t=g^t/\rho$ for all $t=1, ..., T$. So, $h^t\in [-1,1]$ for all $t$. * Consider two calls of MWU($n,T,\epsilon, \rho, \{g^t\}_{t=1}^T$): * Call #1: MWU($n,T,\epsilon, \rho, \{g^t\}_{t=1}^T$) * Call #2: MWU($n,T,\epsilon, 1, \{h^t\}_{t=1}^T$) * Observe that both calls return the same probability distributions $p^1, p^2, ..., p^T$. * This is because the two algorithms differ only on step 2 in the for-loop, but this step get the same result for both calls because $w_i^t \cdot \exp(\epsilon g^t_i/\rho)=w_i^t \cdot \exp(\epsilon h^t_i)$. * By Theorem 2, Call #2 gives $H_*\leq (1+\epsilon) H_A+ \frac{\ln n}{\epsilon}$, where $H_*:= \max_i \sum_t h_i^t$ and $H_A:=\sum_{t=1}^T \langle p^t, h^t\rangle$. * So, $G_*\leq (1+\epsilon)G_A+\frac{\rho \ln n}{\epsilon}$ * by multiplying the previous inequality by $\rho$. * So for $T\geq \frac{\rho\ln(n)}{\epsilon^2}$, we have $\frac{G_*}{T}\leq (1+\epsilon)\frac{G_A}{T}+\epsilon$ * by dividing the previous inequality by $T$. **QED.** ==**In the remaining of this lecture, we focus on proving Theorem 2**.== ## 3. [RealSoftMax](https://en.wikipedia.org/wiki/LogSumExp#cite_note-1) function *Comments:* * You can easily find proofs that do not use RealSoftMax and continuous analysis elsewhere. I am providing a different proof because it helps explain why MWU works. Feel free to read other proofs instead of this one. * Disclaimer: I rarely need continuous analysis and RealSoftMax in my work, so I myself have a weak mathematical background relevant to this proof. --- **Theorem 3:** For any $\epsilon>0$, there is a function $rsmax_\epsilon:\mathbb{R}^n\rightarrow \mathbb{R}$ (read: *RealSoftMax*) with the following properties for any $x\in \mathbb{R}^n$. 1. $\max(x)\leq rsmax_\epsilon(x)\leq \max(x)+\frac{\ln n}{\epsilon}$ 2. $rsmax'_\epsilon(x)_i = \frac{\exp(\epsilon x_i)}{\sum_j \exp(\epsilon x_j)}$ * $rsmax':\mathbb{R}^n \rightarrow \mathbb{R}^n$ denotes the [gradient](https://en.wikipedia.org/wiki/Gradient) of $rsmax$. (You don't need to know the definition of gradient to follow the proof of Theorem 2. All you need to know is the next fact. Also see the discussion below for some intuition.) 3. For any $\epsilon\leq 1/2$ and $\|\delta\|_\infty\leq 1$, $rsmax_\epsilon(x+\delta)\leq rsmax_\epsilon(x)+\langle rsmax_\epsilon'(x),\delta\rangle + \epsilon \frac{\sum_i\exp(\epsilon x_i)\delta_i^2}{\sum_j \exp(\epsilon x_j)}$ --- **Proof sketch (can skip):** We will not prove this theorem today, but let me provide some reference so you can find out more if you're interested. The [RealSoftMax](https://en.wikipedia.org/wiki/LogSumExp#cite_note-1) function is defined for a vector $x=(x_1, x_2, ..., x_n)$ as $$rsmax(x)=\ln(\sum_i \exp(x_i))\ \ and \ \ rsmax_\epsilon(x)=\frac{1}{\epsilon}rsmax(\epsilon x)$$ Theorem 3(1) follows from a claim in [wikipedia](https://en.wikipedia.org/wiki/LogSumExp#cite_note-1). Below is the relevant picture ![](https://i.imgur.com/NKPclU8.png) Theorem 3(2) also follows from a claim in [wikipedia](https://en.wikipedia.org/wiki/LogSumExp#cite_note-1). Below is the relevant picture ![](https://i.imgur.com/OhtcKdG.png) The proof of Theorems 3(2) and 3(3) can also be derived from Sections 10.2-10.3 in [Saranurak's lecture note](https://drive.google.com/file/d/1jkKU-Vw6DF-tFDw32Qe36sH_qcHulP6m/view?usp=sharing). (There, he proved similar properties for $smin_\epsilon$ defined as $smin_\epsilon(x)=-\frac{1}{\epsilon}rxmax(-\epsilon x)$.) **QED** --- ### Discussions for Theorem 3 (can skip) 1. $rsmax_\epsilon(x)\approx \max(x)$, where higher $\epsilon$ gives more accurate $rsmax_\epsilon$. 2. Observe that $rsmax'_\epsilon(x)$ is a *probability distribution*, whose i-coordinate depends on exponential of $x_i$. This is similar to $p^t_i$ from the MWU algorithm. In fact, we'll show in Observation 2 below that $rsmax'_\epsilon(\sum_{t'=1}^{t-1} g_{t'})=p^t.$ 3. The gradient $smin'$ is a good tool for calculating (roughly) how $rsmax_\epsilon(x)$ changes when we change $x$ by some $\delta\in \mathbb{R}^n$: $rsmax_\epsilon(x+\delta) \approx rsmax_\epsilon(x)+\langle rsmax_\epsilon'(x),\delta\rangle$, except some small error of $\epsilon \frac{\exp(\sum_i\epsilon x_i)\delta_i^2}{\sum_j exp(\sum_j \epsilon x_j)}$. Lower $\epsilon$ gives smaller error. **Tug-of-war:** Observe that bigger $\epsilon$ will make the error term in (1) ($\frac{\ln n}{\epsilon}$) smaller but make the error term in (2) $\epsilon \frac{\exp(\sum_i\epsilon x_i)\delta_i^2}{\sum_j exp(\sum_j \epsilon x_j)}$ bigger. **More discussions on Theorem 3(1)** Let talk about $rsmax(x)=\ln(\sum_i \exp(x_i))$. Observe that $rsmax(x)=\max(x)$ when exactly one coordinate is non-zero. So, intuitively, $\max(x)\leq rsmax(x)$. When $x=(1,1,1,...)$, $rsmax(x)=\ln(ne)=1+\ln n=\max(x)+\ln n$. Theorem 3(1) says that nothing worse than this will happen, i.e. $rsmax(x)\leq \max(x)+\ln n$. **More discussions on Theorem 3(3)** * You may recall from a calculus course that for a continuous function $f:\mathbb{R}\rightarrow \mathbb{R}$, its derivative $f':\mathbb{R}\rightarrow \mathbb{R}$ tells us how $f(x)$ would change when we add $x$ by a small $\delta$. A paricularly useful approximation is by the Taylor expansion: $$f(x+\delta)=f(x)+f'(x)\delta+\frac{1}{2}f''(x)\delta^2 + ...$$ When $\delta$ is tiny, this is $$f(x+\delta)\approx f(x)+f'(x)\delta.$$ * For $f:\mathbb{R}^n\rightarrow \mathbb{R}$, Theorem 3(3) states that the gradient $f'$ works the same way as the derivative; i.e. when we change $x$ by "tiny" $\delta\in \mathbb{R}^n$, $$f(x+\delta)\approx f(x)+\langle f'(x),\delta\rangle.$$ ## 4. Proof of Theorem 2 **Definitions:** For any $t\leq T$, let $G^t=\sum_{t'=1}^t g^t.$ Let $G^0=(0,0,...)$. ### 4.1 Lemma 1 **Lemma 1:** $G_*\leq rsmax_\epsilon(G^T)-rsmax(G^0)+\frac{\ln n}{\epsilon}$ **Proof:** \begin{align} G_*&=\max(G^T) &\mbox{by definition}\\ &=\max(G^T)-\max(G^0)\\ &\leq rsmax_\epsilon(G^T)-rsmax(G^0)+\frac{\ln n}{\epsilon} \end{align} where the last inequality is because of theorem 3(2): $-\max(G^0)\leq -rsmax(G^0)+\frac{\ln n}{\epsilon}.$ --- ### 4.2 Lemma 2 **Lemma 2:** If $g^{t} \in [0,1]^{t}$ for all $t$ then, for any $t\leq T$ and $\epsilon\leq 1/2$, $$rsmax_\epsilon(G^t)-rsmax_\epsilon(G^{t-1})\leq (1+\epsilon) \langle p^t,g^t\rangle.$$ --- We will need these simple observations for proving Lemma 2. **Observation 1:** $w^{t}_i=\exp(\epsilon \sum_{t'=1}^{t-1} g^{t'}_i)=\exp(\epsilon G^{t-1}_i)$. **Proof:** Step 2 of the MWU algorithm is states that $w^{t}_i\leftarrow w_i^{t-1} \cdot \exp(\epsilon g^{t-1}_i)$ when $\rho=1$. This implies the first equality. The second equality is by the definition of $G^{t-1}_i$. **QED** **Observation 2:** $rsmax'_\epsilon(G^{t-1})=p^t.$ **Proof:** $rsmax'_\epsilon(G^{t-1})_i=\frac{\exp(\epsilon G^{t-1}_i)}{\sum_j \exp(\epsilon G^{t-1}_j)} =\frac{w^t_i}{\sum_j w^t_j}=p_i^t$ where the second equality is by Observation 1. **QED** --- **Proof of Lemma 2:** * By Thm 3(3), $rsmax_\epsilon(G^{t})=rsmax_\epsilon(G^{t-1}+g^t)\leq rsmax_\epsilon(G^{t-1})+\langle rsmax_\epsilon'(G^{t-1}),g^t\rangle+ \epsilon \frac{\sum_i\exp(\epsilon G^{t-1}_i)(g^t_i)^2}{\sum_j \exp(\epsilon G^{t-1}_j)}$ * We plug in Thm 3(3) with $x=G^{t-1}$ and $\delta=g^t$. * So, $rsmax_\epsilon(G^t)-rsmax_\epsilon(G^{t-1})\leq \langle rsmax_\epsilon'(G^{t-1}),g^t\rangle + \epsilon \frac{\sum_i\exp(\epsilon G^{t-1}_i)(g^t_i)^2}{\sum_j \exp(\epsilon G^{t-1}_j)}$ * By Observation 1, $w^t_i=\exp(\epsilon G^{t-1}_i)$, so $$rsmax_\epsilon(G^t)-rsmax_\epsilon(G^{t-1})\leq \langle rsmax_\epsilon'(G^{t-1}),g^t\rangle + \epsilon \frac{\sum_i w^t_i (g^t_i)^2}{\sum_j w^t_j}.$$ * Since $p_i^t=\frac{w_i^t}{\sum_j w_j^t}$ \begin{align} rsmax_\epsilon(G^t)-rsmax_\epsilon(G^{t-1})&\leq \langle rsmax_\epsilon'(G^{t-1}),g^t\rangle + \epsilon \sum_i p^t_i(g_i^t)^2 \\ &= \langle rsmax_\epsilon'(G^{t-1}),g^t\rangle + \epsilon \langle p^t,(g^t)^2\rangle. \end{align} * By Observation 2, $rsmax'_\epsilon(G^{t-1})=p^t$; so, \begin{align} rsmax_\epsilon(G^t)-rsmax_\epsilon(G^{t-1})&\leq \langle p^t,g^t\rangle + \epsilon \langle p^t,(g^t)^2\rangle. \end{align} * Since $(g^t)^2\leq g^t$ for $g^t\in [0,1]^n$, $$rsmax_\epsilon(G^t)-rsmax_\epsilon(G^{t-1})\leq \langle p^t,g^t\rangle + \epsilon \langle p^t,g^t\rangle. $$ ### 4.3 Finishing the proof of Theorem 2 * By Lemma 2, we have \begin{align} rsmax_\epsilon(G^T)-rsmax_\epsilon(G^0) &= \sum_{t=1}^{T} rsmax_\epsilon(G^{t})-rsmax_\epsilon(G^{t-1})\\ &\leq \sum_{t=1}^{T} (1+\epsilon)\langle p^t,g^t\rangle &\mbox{by Lemma 2}\\ &=(1+\epsilon)G_A &\mbox{since $G_A=\sum_{t=1}^{T} \langle p^t,g^t\rangle$}\\ \end{align} * Combining this with Lemma 1, we have $G_*\leq rsmax_\epsilon(G^T)-rsmax(G^0)+\frac{\ln n}{\epsilon}\leq (1+\epsilon)G_A+\frac{\ln n}{\epsilon}.$ ### 4.4 Discussions (can skip) Here is my interpretation of the proof. (I'm interested in knowing yours too.) Recall that Theorem 2 is about an upper bound of $G^*$. One way to compute an upper bound of $G_*$ is the strategy below. The proof above shows that the MWU algorithm gives exactly the upper bound from this strategy. * **Strategy:** Given a continuous function $f:\mathbb{R}\rightarrow\mathbb{R}$ and a sequence of reals $x^0, x^1, \ldots, x^T$, one strategy to approximate $f(x^T)-f(x^0)$ is this: $$f(x^T)-f(x^0)= \sum_{t=1}^T f(x^{t})-f(x^{t-1}) \leq \sum_{t=1}^T f'(x^t)\delta^t + error,$$ where $\delta^t=x^t-x^{t-1}.$ The error is small when $\delta^1, ..., \delta^t$ are all small. * We would like to use this strategy to compute $G^*=\max(G^T)=\max(G^T)-\max(G^0).$ Since $\max$ is not smooth, we use $rsmax_\epsilon$ instead. Lemma 1 (which follows from Theorem 3(1)) tells us exactly the error when we use $rsmax$ instead of $\max$, i.e. $$G_*=\max(G^T)-\max(G^0)\leq rsmax_\epsilon(G^T)-rsmax(G^0)+\frac{\ln n}{\epsilon}.$$ (Recall that we're interested in an upper bound of $G^*$) * To continue the strategy, we need to give an upper bound for $f(x^{t})-f(x^{t-1})$ when $f=rsmax_\epsilon$ and $x^t=G^t$. Lemma 2 gives us an upper bound that is pretty much $\langle p^t,g^t\rangle$ that the MWU algorithm gives; more precisely, $$rsmax_\epsilon(G^t)-rsmax_\epsilon(G^{t-1})\leq (1+\epsilon) \langle p^t,g^t\rangle$$ * To prove Lemma 2, we bound the error in $f(x^{t})-f(x^{t-1}) \leq f'(x^t)\delta^t+error$ (for $\delta^t=g^t$) using Theorem 3(3) and plug in the value of $f'(x^t)$ from Theorem 3(2). Then, we observe that the term $\exp(\epsilon G_i^{t-1})$ that appears often is equal to $w^t_i$. This is how it connects to the MWU algorithm! ### 4.5 Discussion: Rewriting the algorithm Rewriting the MWU algorithm when $\rho=1$ as below might be another way to see how $rsmax$ plays a role in the proof. **Algorithm MWU($n,T,\epsilon, \rho=1, \{g^t\}_{t=1}^T$):** * Let $p^1=(1/n,1/n,1/n,...)$ * For $t=2, 2, ..., T$: * Let $p_i^t=rsmax'_\epsilon(G^{t-1})$ * This is due to Observation 2 ## Acknowledgement The proof is derived from [Saranurak's lecture note](https://drive.google.com/file/d/1jkKU-Vw6DF-tFDw32Qe36sH_qcHulP6m/view?usp=sharing), which was inspired by the paper of [Quanrud](2https://epubs.siam.org/doi/abs/10.1137/1.9781611976014.11) and the blog of [Zuzic](3https://zuza.github.io/Intuitive-Multiplicative-Weights/).