# 02/09 Lecture 2: Sub-Gaussian variables and Hoeffding bound
[toc]
> Martin discouraged people from taking this class this year because there will be a strict pre-req offered next year covering basic probability and statistics.
> He also claimed to use high-end Japanese chalks recommended by Sasha.
## 1. Sub-Gaussian tail bounds
(*Chapter 2.1.2*)
> Generate tail bounds using sequence depending on the moment control one has.
**Definition:**
A scalar random variable $X$ with $\mu = E[X]$ is **$\sigma-$sub-Gaussian** if
$$\mathbb{E}[e^{\lambda (X-\mu)}] \leq e^{\frac{\lambda^2 \sigma^2}{2}} \forall \lambda \in R$$
For Gaussian variable, the equality holds.
> Let's consider sub-Gaussian variables that are non-Gaussian.
#### Rademacher variables
(*Example 2.3*)
> **Defn:** a Rademacher R.V. $\epsilon$ takes the value $\{-1, +1\}$ equiprobably.
> It is sub-Gaussian with parameter $\sigma = 1$.
A Rademacher variable $\epsilon \in \{-1, +1\}$ with equal probability such that
$$P(\epsilon = +1) = P(\epsilon = -1) = \frac 1 2$$
Then, by taking expectations and play with Taylor expansions, we have that $\forall \lambda \in R$
$$\begin{align*}
\mathbb{E}[e^{\lambda \epsilon}]
&= \frac 1 2 (e^\lambda + e^{-\lambda}) \\
&= \frac 1 2 \left\{\sum_{k=0}^\infty \frac{(-\lambda)^k}{k!} +\sum_{k=0}^\infty \frac{\lambda^k}{k!} \right\} \\
&= \sum_{k=0}^\infty \frac {\lambda^{2k}}{(2k)!}
\text{(Only the even powers remain)}\\
&\leq 1+\sum_{k=1}^\infty \frac{\lambda^{2k}}{2^k k!}\\
&= e^{\frac {\lambda^2} 2}
\end{align*}$$
Therefore, $\epsilon$ is sub-Gaussian with parameter $\sigma=1$.
> Generalized, any bounded R.V. is also sub-Gaussian
#### Bounded random variables
Let $X$ be zero-mean on some bounded interval $[a, b]$.
> We use the technique of **Symmetrization** in here.
Letting $X'$ be an independent copy (_Symmetry_), for any $\lambda \in \mathbb{R}$, we have
$$\mathbb{E}_X[e^{\lambda X}] = \mathbb{E}_X[e^{\lambda(X-E_X'[X'])}] \leq \mathbb{E}_{X, X'} [e^{\lambda(X-X')}]$$ (by _Jensen's inequality_)
> Reminder: **Jensen's Inequality**
$$\phi(\mathbb{E}[X]) \leq \mathbb{E}[\phi(X)]$$ for convex function $\phi$.
Here the inequality holds by convexity of exponential functions.
Let $\epsilon$ be an independent Rademacher variable, note that $\epsilon(X-X')$ has the same distribution as $X-X'$.
> This is because $X$ and $X'$ are from the same distribution, and are independent, so the distribution of $X - X'$ is the same as that of $\epsilon(X' - X$).
Then, we have
$$ \mathbb{E}_{X, X'} [e^{\lambda(X-X')}] = \mathbb{E}_{X, X'} [E_{\epsilon}[e^{\lambda\epsilon(X-X')}]]$$
If condition on $X$, $X'$, we could evaluate the inner expectation. By the previous conclusion,
$$E_{\epsilon}[e^{\lambda\epsilon(X-X')}] \leq e^{\frac{\lambda^2(X-X')^2}{2}}$$
and since $|X-X'| < b-a$ we have that
$$\mathbb{E}_{X, X'} [E_{\epsilon}[e^{\lambda\epsilon(X-X')}]] \leq E_{X, X'} [e^{\frac{\lambda^2(X-X')^2}{2}}] \leq e^{\frac {\lambda^2}{2}(b-a)^2}$$
$X$ is $(b-a)$-sub-Gaussian.
> **Is this the "best" bound we can get?**
No! We can actually improve it to $\sigma = \frac{(b-a)}{2}$. For details, see exercise 2.4.
## 2. Hoeffding bound
(*Chapter 2.1.2*)
> The property of Gaussianity is preserved by linear operations, so is the property of sub-Gaussianity!
If $X_1$ and $X_2$ are indepedent sub-Gaussian variables with parameters $\sigma_1$ and $\sigma_2$, then $X_1 + X_2$ is sub-Gaussian with parameter $\sqrt{\sigma_1^2 + \sigma_2^2}.$
> *Hoeffding bounds* is applicable to sums of independent sub-Gaussian R.V.
Suppose we have variables $X_i, i=1, ..., n$ that are independent, $\mu_i = E[X_i]$, and each $X_i$ is $\sigma_i$-sub-Gaussian.
$$\mathbb{P}[\sum_{i=1}^n (X_i-\mu_i) \geq t] \leq e^\frac{-t^2}{2 \sum_{i=1}^n \sigma_i^2}$$
> Deviation from sample average to population average
> Note from Yan: What he puts on the board doesn't agree with the textbook. I went with the textbook version.
The general technique we use is to bound the MGF and apply Chernoff bound.
> Note: This proof is not stated in the textbook.
$$\mathbb{E}[e^{\lambda\sum_{i=1}^n(X_i-\mu_i)}] = \mathbb{E}[\Pi_{i=1}^n e^{\lambda(X_i-\mu_i)}] = \Pi_{i=1}^n \mathbb{E}[e^{\lambda(X_i-\mu_i)}]$$ The first equality is by the property of exponentials, and the second equality is by the independence of $X_i$'s.
Each $X_i$ is $\sigma_i$-sub-Gaussian so we have
$$\Pi_{i=1}^n \mathbb{E}[e^{\lambda(X_i-\mu_i)}] \leq \Pi_{i=1}^n e^{\frac{\lambda^2\sigma_i^2} 2} = e^{\frac {\lambda^2}{2}(\sum_{i=1}^n \sigma_i^2)}$$
The key take-away is that for $\sigma_i$-sub-Gaussian variables $X_i$, $\sum_{i=1}^n X_i$ is another sub-Gaussian variable with parameter $\sqrt{\sum_{i=1}^n \sigma_i^2}$.
> We could use independence to manipulate MGFs like this.
After this, complete the proof by applying the Chernoff techinque from $e^{\frac {\lambda^2}{2}(\sum_{i=1}^n \sigma_i^2)}$. He skipped as an exercise.
By Chernoff's bound,
$P((X-\mu) \geq t) = P(e^{\lambda(X-\mu)} \geq e^{\lambda t}) \leq \frac{E[e^{\lambda(X-\mu)}]}{e^{\lambda t}}$
$$\mathbb{P}[\sum_{i=1}^n (X_i-\mu_i) \geq t] \leq \frac{E[e^{\lambda\sum_{i=1}^n (X_i-\mu_i)}]}{e^{\lambda t}} \leq e^{\frac {\lambda^2}{2}(\sum_{i=1}^n \sigma_i^2) - \lambda t} \leq e^\frac{-t^2}{2 \sum_{i=1}^n \sigma_i^2}$$
The last inequality comes from bounding the value $\frac {\lambda^2}{2}(\sum_{i=1}^n \sigma_i^2) - \lambda t$ with respect to $\lambda$. By taking derivative, the upper bound is optimized when $\lambda=\frac{t}{\sum_{i=1}^n \sigma_i^2}$. We then plug it back into the original expression and obtain the inequality we desired.
> Martin bashing computer scientists on unbounded random variables.
> Martingale difference sequnces Section(2.3) "beautiful inequalities" "applications on random graphs". I'm not gonna dig into that.
## 3. Application: Multiple Testing
(*Not found in textbook so the notes might be more messed up*)
Suppose we have random variables $Y_i = \theta_i + W_i$.
> $\theta_i$: unknown scalars
> $W_i$: independent, zero-mean, and $\sigma$-sub-Gaussian.
The problem is we would like to differentiate these two scenarios:
- $H_0$: $\theta_i=0$ for all i.
- $H_1$: At least one such $j \in \{1, 2, ..., n\}, \theta_j \geq t > 0$.
> - Useful for drug companies and treatment, experiments and etc. They see outcome and try to judge if the drugs are trash or t-effective.
> - Another application is ads from large corps like Facebook/Google. They figure out products and click through rates. Martin doesn't like that 👎
**Class Question:** *What if we want to figure out which $j$ is actually effective and how effective?*
**Answer:** *Localization (find $j$) is harder. Estimation (how effective) is even harder.*
> He started drawing graph and I can't follow. (Pls send help)
> Basically two graphs between $H_0$ and $H_1$ scenarios and how do we tell them apart.
**Idea**: Make the threshold high enough so that the noises won't be above.
To accomplish this, we want to estimate the probability that the noise is greater than the threshold ($t$). Essentially, <u>we want a high enough $t$ so that this probability is small enough</u>.
**Here is a reasonable testing procedure:**
Declare $H_0$ if $\max_{i=1, .., n} Y_i \leq t$, $H_1$ otherwise.
> *He uses $\tau$ and $t$ interchangably from now on. I will use $t$ for typing simplicity*
$P_{H_0}[failing] = P[\max_{i=1, ..., n} W_i \geq t] \leq \sum_{i=1}^n P(W_i \geq t) = nP(W_1 \geq t) \leq ne^{\frac{-t^2}{2\sigma^2}}$ by union bound and i.i.d. assumptions.
> Student pointed out it's not assumed to be i.i.d.
So the correction is
$P_{H_0}[\text{failing}] = P[\max_{i=1, ..., n} W_i \geq t] \leq \sum_{i=1}^n P(W_i \geq t) \leq \sum_{i=1}^n e^{\frac{-t^2}{2\sigma^2}} = ne^{\frac{-t^2}{2\sigma^2}}$
The conclusion is
$$P_{H_0}[\text{failing}] \leq ne^{\frac{-t^2}{2\sigma^2}}$$
In practice, if the failing rate is some $\delta \in (0, 1)$, we could set $ne^{\frac{-t^2}{2\sigma^2}} \leq \delta$ and solve for $t$.
By algebra, we see that $t=\sqrt{2\sigma^2 \log(\frac n \delta)}$ is the threshold we want to fail within rate $\delta$.
> Intuition: Heavier tail makes the thing in the log term larger (although there's people claiming that log terms are always small). That's why we want sub-Gaussian.
> Comment from Martin: Union bound is for kids. Real mathematicans use something better --- Chaining argument --- which involves partitioning the space or something (I lost track).
> He also mentions something about homework here but I didn't understand.
> "Gaussian is goofy" -- Martin
Refine our Zoology 🐒 of random variables now
## 4. What is not sub-Gaussian? Sub-exponentials!
(*Chapter 2.1.3*)
Let $X \sim N(0, 1)$, then the zero-meaned random variable $Z=X^2-1$ is not sub-Gaussian. Squaring it makes the tail heavier, but by how much?
Turns out we can compute explicitly (Example 2.8)
$$E[e^{\lambda Z}] = \frac 1 {\sqrt{1- \lambda^2}}$$ for $\lambda < 1$, $+\infty$ otherwise.
> Note from Yan: This doesn't agree with the notebook at all? I left it as it is right now but will probably change it after confirmation.
It's actually a **sub-exponential** variable. Any squared sub-Gaussian is sub-exponential.
### Sub-Exponential
**Def**: $X$ is $(\nu, \alpha)$-**sub-exponential** if
$$E[e^{\lambda(X-\mu)}] \leq e^{\frac{\lambda^2\nu^2}{2}} \forall |\lambda| < \frac 1 \alpha$$
> Random comments from Martin: Sasha makes Martin buy expensive chalk but then the classroom moves to a room with whiteboards. Now he has to buy Sasha-recommended expensive markers again. He claims that Sasha should pay for them.
**Corollary**: Any $\sigma$-sub-Gaussian is $(\sigma, 0)$-sub-exponential.
**Claim**: $\frac 1 {\sqrt(1- \lambda^2)} \leq e^{\frac {\lambda^2}{2}(4)} \forall |\lambda| < \frac 1 4$. See Chapter 2.
> He wrote 4 the same way as I do and it took me 30 sec to decipher that.
> Notes from Yan: the LHS follows the previous definition and consistently disagrees with the textbook.
Warmup for next Lecture:
Example: ($\chi^2$-variables)
$$Z=\sum_{j=1}^d X_j^2= ||(X_1, ..., X_d)||_2^2$$ for random variable $X_j \sim N(0, 1)$ independent $X \sim \chi_d^2$.
Each individual random variable is sub-exponential, and the sum is sub-exponential.
Will find: $||(X_1, ..., X_d)||_2^2 \approx d$ with extremely high probability! $(1-e^{-d})$.