02/09 Lecture 2: Sub-Gaussian variables and Hoeffding bound

# 02/09 Lecture 2: Sub-Gaussian variables and Hoeffding bound [toc] > Martin discouraged people from taking this class this year because there will be a strict pre-req offered next year covering basic probability and statistics. > He also claimed to use high-end Japanese chalks recommended by Sasha. ## 1. Sub-Gaussian tail bounds (*Chapter 2.1.2*) > Generate tail bounds using sequence depending on the moment control one has. **Definition:** A scalar random variable $X$ with $\mu = E[X]$ is **$\sigma-$sub-Gaussian** if $$\mathbb{E}[e^{\lambda (X-\mu)}] \leq e^{\frac{\lambda^2 \sigma^2}{2}} \forall \lambda \in R$$ For Gaussian variable, the equality holds. > Let's consider sub-Gaussian variables that are non-Gaussian. #### Rademacher variables (*Example 2.3*) > **Defn:** a Rademacher R.V. $\epsilon$ takes the value $\{-1, +1\}$ equiprobably. > It is sub-Gaussian with parameter $\sigma = 1$. A Rademacher variable $\epsilon \in \{-1, +1\}$ with equal probability such that $$P(\epsilon = +1) = P(\epsilon = -1) = \frac 1 2$$ Then, by taking expectations and play with Taylor expansions, we have that $\forall \lambda \in R$ $$\begin{align*} \mathbb{E}[e^{\lambda \epsilon}] &= \frac 1 2 (e^\lambda + e^{-\lambda}) \\ &= \frac 1 2 \left\{\sum_{k=0}^\infty \frac{(-\lambda)^k}{k!} +\sum_{k=0}^\infty \frac{\lambda^k}{k!} \right\} \\ &= \sum_{k=0}^\infty \frac {\lambda^{2k}}{(2k)!} \text{(Only the even powers remain)}\\ &\leq 1+\sum_{k=1}^\infty \frac{\lambda^{2k}}{2^k k!}\\ &= e^{\frac {\lambda^2} 2} \end{align*}$$ Therefore, $\epsilon$ is sub-Gaussian with parameter $\sigma=1$. > Generalized, any bounded R.V. is also sub-Gaussian #### Bounded random variables Let $X$ be zero-mean on some bounded interval $[a, b]$. > We use the technique of **Symmetrization** in here. Letting $X'$ be an independent copy (_Symmetry_), for any $\lambda \in \mathbb{R}$, we have $$\mathbb{E}_X[e^{\lambda X}] = \mathbb{E}_X[e^{\lambda(X-E_X'[X'])}] \leq \mathbb{E}_{X, X'} [e^{\lambda(X-X')}]$$ (by _Jensen's inequality_) > Reminder: **Jensen's Inequality** $$\phi(\mathbb{E}[X]) \leq \mathbb{E}[\phi(X)]$$ for convex function $\phi$. Here the inequality holds by convexity of exponential functions. Let $\epsilon$ be an independent Rademacher variable, note that $\epsilon(X-X')$ has the same distribution as $X-X'$. > This is because $X$ and $X'$ are from the same distribution, and are independent, so the distribution of $X - X'$ is the same as that of $\epsilon(X' - X$). Then, we have $$ \mathbb{E}_{X, X'} [e^{\lambda(X-X')}] = \mathbb{E}_{X, X'} [E_{\epsilon}[e^{\lambda\epsilon(X-X')}]]$$ If condition on $X$, $X'$, we could evaluate the inner expectation. By the previous conclusion, $$E_{\epsilon}[e^{\lambda\epsilon(X-X')}] \leq e^{\frac{\lambda^2(X-X')^2}{2}}$$ and since $|X-X'| < b-a$ we have that $$\mathbb{E}_{X, X'} [E_{\epsilon}[e^{\lambda\epsilon(X-X')}]] \leq E_{X, X'} [e^{\frac{\lambda^2(X-X')^2}{2}}] \leq e^{\frac {\lambda^2}{2}(b-a)^2}$$ $X$ is $(b-a)$-sub-Gaussian. > **Is this the "best" bound we can get?** No! We can actually improve it to $\sigma = \frac{(b-a)}{2}$. For details, see exercise 2.4. ## 2. Hoeffding bound (*Chapter 2.1.2*) > The property of Gaussianity is preserved by linear operations, so is the property of sub-Gaussianity! If $X_1$ and $X_2$ are indepedent sub-Gaussian variables with parameters $\sigma_1$ and $\sigma_2$, then $X_1 + X_2$ is sub-Gaussian with parameter $\sqrt{\sigma_1^2 + \sigma_2^2}.$ > *Hoeffding bounds* is applicable to sums of independent sub-Gaussian R.V. Suppose we have variables $X_i, i=1, ..., n$ that are independent, $\mu_i = E[X_i]$, and each $X_i$ is $\sigma_i$-sub-Gaussian. $$\mathbb{P}[\sum_{i=1}^n (X_i-\mu_i) \geq t] \leq e^\frac{-t^2}{2 \sum_{i=1}^n \sigma_i^2}$$ > Deviation from sample average to population average > Note from Yan: What he puts on the board doesn't agree with the textbook. I went with the textbook version. The general technique we use is to bound the MGF and apply Chernoff bound. > Note: This proof is not stated in the textbook. $$\mathbb{E}[e^{\lambda\sum_{i=1}^n(X_i-\mu_i)}] = \mathbb{E}[\Pi_{i=1}^n e^{\lambda(X_i-\mu_i)}] = \Pi_{i=1}^n \mathbb{E}[e^{\lambda(X_i-\mu_i)}]$$ The first equality is by the property of exponentials, and the second equality is by the independence of $X_i$'s. Each $X_i$ is $\sigma_i$-sub-Gaussian so we have $$\Pi_{i=1}^n \mathbb{E}[e^{\lambda(X_i-\mu_i)}] \leq \Pi_{i=1}^n e^{\frac{\lambda^2\sigma_i^2} 2} = e^{\frac {\lambda^2}{2}(\sum_{i=1}^n \sigma_i^2)}$$ The key take-away is that for $\sigma_i$-sub-Gaussian variables $X_i$, $\sum_{i=1}^n X_i$ is another sub-Gaussian variable with parameter $\sqrt{\sum_{i=1}^n \sigma_i^2}$. > We could use independence to manipulate MGFs like this. After this, complete the proof by applying the Chernoff techinque from $e^{\frac {\lambda^2}{2}(\sum_{i=1}^n \sigma_i^2)}$. He skipped as an exercise. By Chernoff's bound, $P((X-\mu) \geq t) = P(e^{\lambda(X-\mu)} \geq e^{\lambda t}) \leq \frac{E[e^{\lambda(X-\mu)}]}{e^{\lambda t}}$ $$\mathbb{P}[\sum_{i=1}^n (X_i-\mu_i) \geq t] \leq \frac{E[e^{\lambda\sum_{i=1}^n (X_i-\mu_i)}]}{e^{\lambda t}} \leq e^{\frac {\lambda^2}{2}(\sum_{i=1}^n \sigma_i^2) - \lambda t} \leq e^\frac{-t^2}{2 \sum_{i=1}^n \sigma_i^2}$$ The last inequality comes from bounding the value $\frac {\lambda^2}{2}(\sum_{i=1}^n \sigma_i^2) - \lambda t$ with respect to $\lambda$. By taking derivative, the upper bound is optimized when $\lambda=\frac{t}{\sum_{i=1}^n \sigma_i^2}$. We then plug it back into the original expression and obtain the inequality we desired. > Martin bashing computer scientists on unbounded random variables. > Martingale difference sequnces Section(2.3) "beautiful inequalities" "applications on random graphs". I'm not gonna dig into that. ## 3. Application: Multiple Testing (*Not found in textbook so the notes might be more messed up*) Suppose we have random variables $Y_i = \theta_i + W_i$. > $\theta_i$: unknown scalars > $W_i$: independent, zero-mean, and $\sigma$-sub-Gaussian. The problem is we would like to differentiate these two scenarios: - $H_0$: $\theta_i=0$ for all i. - $H_1$: At least one such $j \in \{1, 2, ..., n\}, \theta_j \geq t > 0$. > - Useful for drug companies and treatment, experiments and etc. They see outcome and try to judge if the drugs are trash or t-effective. > - Another application is ads from large corps like Facebook/Google. They figure out products and click through rates. Martin doesn't like that 👎 **Class Question:** *What if we want to figure out which $j$ is actually effective and how effective?* **Answer:** *Localization (find $j$) is harder. Estimation (how effective) is even harder.* > He started drawing graph and I can't follow. (Pls send help) > Basically two graphs between $H_0$ and $H_1$ scenarios and how do we tell them apart. **Idea**: Make the threshold high enough so that the noises won't be above. To accomplish this, we want to estimate the probability that the noise is greater than the threshold ($t$). Essentially, <u>we want a high enough $t$ so that this probability is small enough</u>. **Here is a reasonable testing procedure:** Declare $H_0$ if $\max_{i=1, .., n} Y_i \leq t$, $H_1$ otherwise. > *He uses $\tau$ and $t$ interchangably from now on. I will use $t$ for typing simplicity* $P_{H_0}[failing] = P[\max_{i=1, ..., n} W_i \geq t] \leq \sum_{i=1}^n P(W_i \geq t) = nP(W_1 \geq t) \leq ne^{\frac{-t^2}{2\sigma^2}}$ by union bound and i.i.d. assumptions. > Student pointed out it's not assumed to be i.i.d. So the correction is $P_{H_0}[\text{failing}] = P[\max_{i=1, ..., n} W_i \geq t] \leq \sum_{i=1}^n P(W_i \geq t) \leq \sum_{i=1}^n e^{\frac{-t^2}{2\sigma^2}} = ne^{\frac{-t^2}{2\sigma^2}}$ The conclusion is $$P_{H_0}[\text{failing}] \leq ne^{\frac{-t^2}{2\sigma^2}}$$ In practice, if the failing rate is some $\delta \in (0, 1)$, we could set $ne^{\frac{-t^2}{2\sigma^2}} \leq \delta$ and solve for $t$. By algebra, we see that $t=\sqrt{2\sigma^2 \log(\frac n \delta)}$ is the threshold we want to fail within rate $\delta$. > Intuition: Heavier tail makes the thing in the log term larger (although there's people claiming that log terms are always small). That's why we want sub-Gaussian. > Comment from Martin: Union bound is for kids. Real mathematicans use something better --- Chaining argument --- which involves partitioning the space or something (I lost track). > He also mentions something about homework here but I didn't understand. > "Gaussian is goofy" -- Martin Refine our Zoology 🐒 of random variables now ## 4. What is not sub-Gaussian? Sub-exponentials! (*Chapter 2.1.3*) Let $X \sim N(0, 1)$, then the zero-meaned random variable $Z=X^2-1$ is not sub-Gaussian. Squaring it makes the tail heavier, but by how much? Turns out we can compute explicitly (Example 2.8) $$E[e^{\lambda Z}] = \frac 1 {\sqrt{1- \lambda^2}}$$ for $\lambda < 1$, $+\infty$ otherwise. > Note from Yan: This doesn't agree with the notebook at all? I left it as it is right now but will probably change it after confirmation. It's actually a **sub-exponential** variable. Any squared sub-Gaussian is sub-exponential. ### Sub-Exponential **Def**: $X$ is $(\nu, \alpha)$-**sub-exponential** if $$E[e^{\lambda(X-\mu)}] \leq e^{\frac{\lambda^2\nu^2}{2}} \forall |\lambda| < \frac 1 \alpha$$ > Random comments from Martin: Sasha makes Martin buy expensive chalk but then the classroom moves to a room with whiteboards. Now he has to buy Sasha-recommended expensive markers again. He claims that Sasha should pay for them. **Corollary**: Any $\sigma$-sub-Gaussian is $(\sigma, 0)$-sub-exponential. **Claim**: $\frac 1 {\sqrt(1- \lambda^2)} \leq e^{\frac {\lambda^2}{2}(4)} \forall |\lambda| < \frac 1 4$. See Chapter 2. > He wrote 4 the same way as I do and it took me 30 sec to decipher that. > Notes from Yan: the LHS follows the previous definition and consistently disagrees with the textbook. Warmup for next Lecture: Example: ($\chi^2$-variables) $$Z=\sum_{j=1}^d X_j^2= ||(X_1, ..., X_d)||_2^2$$ for random variable $X_j \sim N(0, 1)$ independent $X \sim \chi_d^2$. Each individual random variable is sub-exponential, and the sum is sub-exponential. Will find: $||(X_1, ..., X_d)||_2^2 \approx d$ with extremely high probability! $(1-e^{-d})$.