# This document has been deprecated. Please go to the workspace [https://hackmd.io/team/18656spring23?nav=overview] If you don't have access please let us know in the group chat.
## 02/07 Lecture 1: Introduction
### 1. Logistics
1. **Homework**: There will be 4 psets. They will start easy and get harder at the end.
2. **Background**: The class will need background in probability, statistics, linear algebra, real analysis. Read chapter 2 of the textbook for further reference. In general, need some math maturity. Reach out if have questions about background.
3. **Materials**:
- Martin Wainwright, _High-Dimensional Statistics: A Non-Asymptotic Viewpoint_. Cambridge University Press.
- Philippe Rigollet and Jan-Christian Hütter, _High Dimensional Statistics_. Lecture Notes (MIT).
- Roman Vershynin, _High-Dimensional Probability: An Introduction with Applications in Data Science_. Cambridge University Press.
- Ramon van Handel, _Probability in High Dimensions_. Lecture Notes (Princeton).
4. **Extra help**: Some recitations will be scheduled at the beginning of the class to fill the gaps people might have. Office hours will also be scheduled.
### 2. What's the class about?
The class is mostly about non-traditional stats, machinery/theory/models in high-dim or non-asymtotic approach.
### 3. A Motivating Example: Covariance Estimation
_(Chapter 1, Wainwright)_
Suppose $X_i \in \mathbb{R}^d, \mathbb{E}[X]=0, \Sigma = \operatorname{cov}(X) \in \mathbb{R}^{d \times d}$.
> Stats Reminder: The covariance matrix of a random vector $X$, is basically the variance of $X$, which is a square matrix that contains all the covariances between the entries of the vector.
If we want to estimate the unknown $\Sigma$, given some random vectors $\{x_1, ..., x_n\}$ are drawn i.i.d. from some distribution in $\mathbb{R}^d$, we could use the _sample covariance matrix_
$$\hat{\Sigma} = \frac 1 n \sum_{i=1}^n x_ix_i^T$$
a $d \times d$ random matrix corresponding to the sample average of the outer products $x_i x_i^T \in \mathbb{R}^{d \times d}$.
> Notation: $\hat{}$ means data/observation, $n$ means the size of dataset from now on.
By construction, the sample covariance is unbiased, meaning that $\mathbb{E}[\hat{\Sigma}_n]=\Sigma$.
Three modes of analysis:
1. **Classical asymtotic analysis**: $\hat{\Sigma}_n \rightarrow \Sigma$ when $n$ goes infinity and $d$ stays fixed.
> This argument is not informative since, for example, in some analysis we only care about a finite small $n$ and some changing $d$.
2. **High-dimensional asymtotics**: What if we increase $d$ and $n$ but keep their ratio unchanged? To put it formally, $d \rightarrow \infty, n \rightarrow \infty, \frac d n \rightarrow \alpha \in (0, 1)$.
<u>Marcenko-Pastur law:</u>
Take a bunch of $(\hat{\Sigma}, \Sigma)$ indexed with $(n,d)$ and compute the eigenvalues of the random sample covariance matrix. $\Sigma= I_{d\times d}$ with eigenvalue all 1s so it should cluster around 1.
MP-law says the value is not exactly like a bell shape centered around 1.
3. **Non-asymtotic analysis**:
What's different in our approach is that we want to provide an *exact* non-asymptotic bounds.
> Define: $l_2$-operator norm. $M\in R^{d \times d}$, $l_2$-norm is $|||M|||_2 := \max_{||u||=1, u\in R^d} ||Mu||_2$. max eigenvalue
We could get some bound like this to measure exactly how much the sample deviates from the ground truth:
$P[|||\hat{\Sigma_n}-\Sigma|||_{l_2} \geq c_0 \max\{\sqrt{\frac d n}, \frac{d}{ n}\}+ t] \leq 2 e^{-c_1nt} \forall t \in [0,1]$
In this approach, terms all matter, they don't go away. The non-asymptotic approach implies all the classical asymptotic theory, but also gives us much more.
```
Overview of the class (Part I):
- Tail bounds and concentration inequalities
- Applications:
- mean estimation
- least-square estimators
- random sketching
- matrix estimation problems
- PCA
- spectral clustering
- touch on emprical process (briefly cover it today)
- structure and how to exploit
```
#### Touch on emprical process
How to analyze $|||\hat{\Sigma_n}-\Sigma|||_{l_2}$ ?
Unpacking a little bit:
By definition:
$$|||\hat{\Sigma_n}-\Sigma|||_{l_2} = \max_{||u||=1, u\in R^d} ||(\hat{\Sigma_n}-\Sigma)u||_2 \\
= \max_{||u||=1, u\in R^d} \max_{||v||=1, v\in R^d} v^T(\hat{\Sigma_n}-\Sigma)u$$
> Using the fact that $||a||_2=\max_{||b||=1} b^Ta$
Special case when $u=v$:
$$\max_{||u||=1, u\in R^d} u^T(\hat{\Sigma_n}-\Sigma)u\\
= \max \left\{ \frac 1 n \sum_{i=1}^n \langle x_i, u\rangle ^2 =E[<x_i ,u>^2]=E[u^Tx_ix_i^Tu]=E[u^T\Sigma u] \right\}$$
### 4. Basic Tail Bounds
_(Chap 2, Wainwright)_
#### a) Some classical bounds:
- **Markov**: Random variable $Z \geq 0$, $P(Z > t)=\frac{E[Z]}{t}$
- **Chebyshev**: Random variable $X, E[X]=\mu, Var(X)=\sigma$, $P(|X-\mu| \geq t) \geq \frac{\sigma}{t^2}$
- **Super p-th chebyshev**: $P(|X-\mu|^{2p} \geq t^{2p}) \geq \frac{E[(X-\mu)^{2p}]}{t^2}$
- **Chernoff bound**: $P((X-\mu) \geq t) = P(e^{\lambda(X-\mu)} \geq e^{\lambda t}) \leq \frac{E[e^{\lambda(X-\mu)}]}{e^{\lambda t}}$
Assuming Moment Generating Function, meaning that there is some constant $b > 0$ such that:
$$E[e^{\lambda(X-\mu)}] < \infty, \lambda \in (0, b)$$
> Stats Reminder: The nth moment of $X$ is $E[X^n]$
> Stats Reminder: Moment-generating function (MGF) is denoted by $M_X(t) = E[e^{tX}]$, provided that this expectation exists for $t$ in some neighborhood of $0$. Otherwise, we say the MGF doesn't exist.
MGF generates moment by Taylor expansion:
$$M_X(t)=E[e^{tX}]=1+tE[X] + \frac{t^2E(X^2)}{2!}+...$$
#### Classification of random variables based on tail bound by Chernoff
Ex. Cauchy has heavy tail 🐶
##### Gaussian tail bounds
Let $X \sim N(\mu, \sigma^2)$, by Chernoff bound, we have $$P[(X-\mu) \geq t]=P[e^{\lambda(X-\mu)} \geq e^{\lambda t}] \leq \frac{E[e^{\lambda(X-\mu)}]}{e^{\lambda t}}$$
$E[e^{\lambda (X-\mu)}] =$ ~~some integral, not even in textbook and prof got it wrong three times during lecture~~ $=e^{\frac{\lambda^2\sigma^2}{2}}$
We find the range of lambdas that result in finite MGF and optimize for the tightest bound.
$P((X-\mu) \geq t) \leq \inf_{\lambda \geq 0} [e^{\frac {\lambda^2\sigma^2}{2}-\lambda t}] = e^{-\frac{t^2}{2\sigma^2}}$ for all $t \geq 0$.
> _The derivation is not in the textbook, but should be easy_
Take the derivative of $\frac {\lambda^2\sigma^2}{2}-\lambda t$ with respect to $\lambda$ and set to 0, we get $\lambda=\frac{t}{\sigma^2}$. Then plug the $\lambda$ value in and get the value $-\frac{t^2}{2\sigma^2}$
Mill's ratio tells us that this bound is sharp up to polynomial factor corrections
> In the textbook Exercise 2.2 they derive it. Feel like it's gonna be on the pset ngl
#### b) Sub-Gaussian
**Definition:**
A scalar random variable X with $\mu = e[X]$ is **$\sigma-$sub-Gaussian** if
$$\mathbb{E}[e^{\lambda (X-\mu)}] \leq e^{\frac{\lambda^2 \sigma^2}{2}} \forall \lambda \in R$$
> The lecture ends with this definition, he'll probably pick this up in the next lecture and prove it.
---
## 02/09 Lecture 2: Sub-Gaussian variables and Hoeffding bound
> Martin discouraged people from taking this class this year because there will be a strict pre-req offered next year covering basic probability and statistics.
> He also claimed to use high-end Japanese chalks recommended by Sasha.
### 1. Sub-Gaussian tail bounds
(*Chapter 2.1.2*)
> Generate tail bounds using sequence depending on the moment control one has.
**Definition:**
A scalar random variable $X$ with $\mu = E[X]$ is **$\sigma-$sub-Gaussian** if
$$\mathbb{E}[e^{\lambda (X-\mu)}] \leq e^{\frac{\lambda^2 \sigma^2}{2}} \forall \lambda \in R$$
For Gaussian variable, the equality holds.
> Let's consider sub-Gaussian variables that are non-Gaussian.
#### Rademacher variables
(*Example 2.3*)
> **Defn:** a Rademacher R.V. $\epsilon$ takes the value $\{-1, +1\}$ equiprobably.
> It is sub-Gaussian with parameter $\sigma = 1$.
A Rademacher variable $\epsilon \in \{-1, +1\}$ with equal probability such that
$$P(\epsilon = +1) = P(\epsilon = -1) = \frac 1 2$$
Then, by taking expectations and play with Taylor expansions, we have that $\forall \lambda \in R$
$$\begin{align*}
\mathbb{E}[e^{\lambda \epsilon}]
&= \frac 1 2 (e^\lambda + e^{-\lambda}) \\
&= \frac 1 2 \left\{\sum_{k=0}^\infty \frac{(-\lambda)^k}{k!} +\sum_{k=0}^\infty \frac{\lambda^k}{k!} \right\} \\
&= \sum_{k=0}^\infty \frac {\lambda^{2k}}{(2k)!}
\text{(Only the even powers remain)}\\
&\leq 1+\sum_{k=1}^\infty \frac{\lambda^{2k}}{2^k k!}\\
&= e^{\frac {\lambda^2} 2}
\end{align*}$$
Therefore, $\epsilon$ is sub-Gaussian with parameter $\sigma=1$.
> Generalized, any bounded R.V. is also sub-Gaussian
#### Bounded random variables
Let $X$ be zero-mean on some bounded interval $[a, b]$.
> We use the technique of **Symmetrization** in here.
Letting $X'$ be an independent copy (_Symmetry_), for any $\lambda \in \mathbb{R}$, we have
$$\mathbb{E}_X[e^{\lambda X}] = \mathbb{E}_X[e^{\lambda(X-E_X'[X'])}] \leq \mathbb{E}_{X, X'} [e^{\lambda(X-X')}]$$ (by _Jensen's inequality_)
> Reminder: **Jensen's Inequality**
$$\phi(\mathbb{E}[X]) \leq \mathbb{E}[\phi(X)]$$ for convex function $\phi$.
Here the inequality holds by convexity of exponential functions.
Let $\epsilon$ be an independent Rademacher variable, note that $\epsilon(X-X')$ has the same distribution as $X-X'$.
> This is because $X$ and $X'$ are from the same distribution, and are independent, so the distribution of $X - X'$ is the same as that of $\epsilon(X' - X$).
Then, we have
$$ \mathbb{E}_{X, X'} [e^{\lambda(X-X')}] = \mathbb{E}_{X, X'} [E_{\epsilon}[e^{\lambda\epsilon(X-X')}]]$$
If condition on $X$, $X'$, we could evaluate the inner expectation. By the previous conclusion,
$$E_{\epsilon}[e^{\lambda\epsilon(X-X')}] \leq e^{\frac{\lambda^2(X-X')^2}{2}}$$
and since $|X-X'| < b-a$ we have that
$$\mathbb{E}_{X, X'} [E_{\epsilon}[e^{\lambda\epsilon(X-X')}]] \leq E_{X, X'} [e^{\frac{\lambda^2(X-X')^2}{2}}] \leq e^{\frac {\lambda^2}{2}(b-a)^2}$$
$X$ is $(b-a)$-sub-Gaussian.
> **Is this the "best" bound we can get?**
No! We can actually improve it to $\sigma = \frac{(b-a)}{2}$. For details, see exercise 2.4.
### 2. Hoeffding bound
(*Chapter 2.1.2*)
> The property of Gaussianity is preserved by linear operations, so is the property of sub-Gaussianity!
If $X_1$ and $X_2$ are indepedent sub-Gaussian variables with parameters $\sigma_1$ and $\sigma_2$, then $X_1 + X_2$ is sub-Gaussian with parameter $\sqrt{\sigma_1^2 + \sigma_2^2}.$
> *Hoeffding bounds* is applicable to sums of independent sub-Gaussian R.V.
Suppose we have variables $X_i, i=1, ..., n$ that are independent, $\mu_i = E[X_i]$, and each $X_i$ is $\sigma_i$-sub-Gaussian.
$$\mathbb{P}[\sum_{i=1}^n (X_i-\mu_i) \geq t] \leq e^\frac{-t^2}{2 \sum_{i=1}^n \sigma_i^2}$$
> Deviation from sample average to population average
> Note from Yan: What he puts on the board doesn't agree with the textbook. I went with the textbook version.
The general technique we use is to bound the MGF and apply Chernoff bound.
> Note: This proof is not stated in the textbook.
$$\mathbb{E}[e^{\lambda\sum_{i=1}^n(X_i-\mu_i)}] = \mathbb{E}[\Pi_{i=1}^n e^{\lambda(X_i-\mu_i)}] = \Pi_{i=1}^n \mathbb{E}[e^{\lambda(X_i-\mu_i)}]$$ The first equality is by the property of exponentials, and the second equality is by the independence of $X_i$'s.
Each $X_i$ is $\sigma_i$-sub-Gaussian so we have
$$\Pi_{i=1}^n \mathbb{E}[e^{\lambda(X_i-\mu_i)}] \leq \Pi_{i=1}^n e^{\frac{\lambda^2\sigma_i^2} 2} = e^{\frac {\lambda^2}{2}(\sum_{i=1}^n \sigma_i^2)}$$
The key take-away is that for $\sigma_i$-sub-Gaussian variables $X_i$, $\sum_{i=1}^n X_i$ is another sub-Gaussian variable with parameter $\sqrt{\sum_{i=1}^n \sigma_i^2}$.
> We could use independence to manipulate MGFs like this.
After this, complete the proof by applying the Chernoff techinque from $e^{\frac {\lambda^2}{2}(\sum_{i=1}^n \sigma_i^2)}$. He skipped as an exercise.
By Chernoff's bound,
$P((X-\mu) \geq t) = P(e^{\lambda(X-\mu)} \geq e^{\lambda t}) \leq \frac{E[e^{\lambda(X-\mu)}]}{e^{\lambda t}}$
$$\mathbb{P}[\sum_{i=1}^n (X_i-\mu_i) \geq t] \leq \frac{E[e^{\lambda\sum_{i=1}^n (X_i-\mu_i)}]}{e^{\lambda t}} \leq e^{\frac {\lambda^2}{2}(\sum_{i=1}^n \sigma_i^2) - \lambda t} \leq e^\frac{-t^2}{2 \sum_{i=1}^n \sigma_i^2}$$
The last inequality comes from bounding the value $\frac {\lambda^2}{2}(\sum_{i=1}^n \sigma_i^2) - \lambda t$ with respect to $\lambda$. By taking derivative, the upper bound is optimized when $\lambda=\frac{t}{\sum_{i=1}^n \sigma_i^2}$. We then plug it back into the original expression and obtain the inequality we desired.
> Martin bashing computer scientists on unbounded random variables.
> Martingale difference sequnces Section(2.3) "beautiful inequalities" "applications on random graphs". I'm not gonna dig into that.
### 3. Application: Multiple Testing
(*Not found in textbook so the notes might be more messed up*)
Suppose we have random variables $Y_i = \theta_i + W_i$.
> $\theta_i$: unknown scalars
> $W_i$: independent, zero-mean, and $\sigma$-sub-Gaussian.
The problem is we would like to differentiate these two scenarios:
- $H_0$: $\theta_i=0$ for all i.
- $H_1$: At least one such $j \in \{1, 2, ..., n\}, \theta_j \geq t > 0$.
> - Useful for drug companies and treatment, experiments and etc. They see outcome and try to judge if the drugs are trash or t-effective.
> - Another application is ads from large corps like Facebook/Google. They figure out products and click through rates. Martin doesn't like that 👎
**Class Question:** *What if we want to figure out which $j$ is actually effective and how effective?*
**Answer:** *Localization (find $j$) is harder. Estimation (how effective) is even harder.*
> He started drawing graph and I can't follow. (Pls send help)
> Basically two graphs between $H_0$ and $H_1$ scenarios and how do we tell them apart.
**Idea**: Make the threshold high enough so that the noises won't be above.
To accomplish this, we want to estimate the probability that the noise is greater than the threshold ($t$). Essentially, <u>we want a high enough $t$ so that this probability is small enough</u>.
**Here is a reasonable testing procedure:**
Declare $H_0$ if $\max_{i=1, .., n} Y_i \leq t$, $H_1$ otherwise.
> *He uses $\tau$ and $t$ interchangably from now on. I will use $t$ for typing simplicity*
$P_{H_0}[failing] = P[\max_{i=1, ..., n} W_i \geq t] \leq \sum_{i=1}^n P(W_i \geq t) = nP(W_1 \geq t) \leq ne^{\frac{-t^2}{2\sigma^2}}$ by union bound and i.i.d. assumptions.
> Student pointed out it's not assumed to be i.i.d.
So the correction is
$P_{H_0}[\text{failing}] = P[\max_{i=1, ..., n} W_i \geq t] \leq \sum_{i=1}^n P(W_i \geq t) \leq \sum_{i=1}^n e^{\frac{-t^2}{2\sigma^2}} = ne^{\frac{-t^2}{2\sigma^2}}$
The conclusion is
$$P_{H_0}[\text{failing}] \leq ne^{\frac{-t^2}{2\sigma^2}}$$
In practice, if the failing rate is some $\delta \in (0, 1)$, we could set $ne^{\frac{-t^2}{2\sigma^2}} \leq \delta$ and solve for $t$.
By algebra, we see that $t=\sqrt{2\sigma^2 \log(\frac n \delta)}$ is the threshold we want to fail within rate $\delta$.
> Intuition: Heavier tail makes the thing in the log term larger (although there's people claiming that log terms are always small). That's why we want sub-Gaussian.
> Comment from Martin: Union bound is for kids. Real mathematicans use something better --- Chaining argument --- which involves partitioning the space or something (I lost track).
> He also mentions something about homework here but I didn't understand.
> "Gaussian is goofy" -- Martin
Refine our Zoology 🐒 of random variables now
### 4. What is not sub-Gaussian? Sub-exponentials!
(*Chapter 2.1.3*)
Let $X \sim N(0, 1)$, then the zero-meaned random variable $Z=X^2-1$ is not sub-Gaussian. Squaring it makes the tail heavier, but by how much?
Turns out we can compute explicitly (Example 2.8)
$$E[e^{\lambda Z}] = \frac 1 {\sqrt{1- \lambda^2}}$$ for $\lambda < 1$, $+\infty$ otherwise.
> Note from Yan: This doesn't agree with the notebook at all? I left it as it is right now but will probably change it after confirmation.
It's actually a **sub-exponential** variable. Any squared sub-Gaussian is sub-exponential.
#### Sub-Exponential
**Def**: $X$ is $(\nu, \alpha)$-**sub-exponential** if
$$E[e^{\lambda(X-\mu)}] \leq e^{\frac{\lambda^2\nu^2}{2}} \forall |\lambda| < \frac 1 \alpha$$
> Random comments from Martin: Sasha makes Martin buy expensive chalk but then the classroom moves to a room with whiteboards. Now he has to buy Sasha-recommended expensive markers again. He claims that Sasha should pay for them.
**Corollary**: Any $\sigma$-sub-Gaussian is $(\sigma, 0)$-sub-exponential.
**Claim**: $\frac 1 {\sqrt(1- \lambda^2)} \leq e^{\frac {\lambda^2}{2}(4)} \forall |\lambda| < \frac 1 4$. See Chapter 2.
> He wrote 4 the same way as I do and it took me 30 sec to decipher that.
> Notes from Yan: the LHS follows the previous definition and consistently disagrees with the textbook.
Warmup for next Lecture:
Example: ($\chi^2$-variables)
$$Z=\sum_{j=1}^d X_j^2= ||(X_1, ..., X_d)||_2^2$$ for random variable $X_j \sim N(0, 1)$ independent $X \sim \chi_d^2$.
Each individual random variable is sub-exponential, and the sum is sub-exponential.
Will find: $||(X_1, ..., X_d)||_2^2 \approx d$ with extremely high probability! $(1-e^{-d})$.
## 02/14 Lecture 3: Sub-exponential variables, Bernstein and Randomized dimensionality reduction
### 1. Sub-exponential bounds
#### Motivating application
(_Example 2.12_)
Given a set of vectors, or dataset, $\{u^1, ..., u^N\}, u^j \in R^d$, $N =$ the number of vectors, we construct a mapping $F: R^d \rightarrow R^m, m<<d$.
Then, we have $\{F(u^1), F(u^N)\}, F(u^j) \in R^m$, $m =$ embedding dimension.
> **Why do we want to reduce the dimension?**
Storage and computing will be much cheaper. We could also produce better visualization.
Ideally, this reduction should preserve some "essential" features of the dataset. For example, we could consider preserving relative distance.
**$\delta$-faithful $l_2$ embedding condition**:
$$1-\delta \leq \frac{||F(u^i)-F(u^j)||}{||u^i-u^j||_2^2}_2^2 \leq 1+\delta \ \ \ \ \text{ for all pairs } u^i \neq u^j.$$
$\delta \in (0, 1)$ is the tolerance. In words, the projected dataset preserves all pairwse squared distances up to a multiplicative factor of $\delta$.
It turns out that we could construct a random linear mapping s.t. $m > \frac{16}{\delta^2} \log(\frac{N}{\epsilon})$ such that it will satisfy the above property with probability $\geq 1-\epsilon$.
> m has dependence on N but not d, which means that this reduction is actually dimension independent. It might not be ideal because it's oblivious to the characteristics of the data set itself unlike other techniques like PCA, but it also means that it is applicable for any kind of data set.
> Martin: Let's eat our broccoli 🥦 and learn about the bounds now.
Recall our definition for **$(\nu, \alpha)$-sub-exponential**: Given r.v. $X, \mathbb{E}[X] = \mu$, we have
$$\mathbb{E}(e^{\lambda(X-\mu)}) \leq e^{\frac{\lambda \nu^2}{2}} \ \ \ \ \forall |\lambda| \leq \frac{1}{\alpha}$$.
**Propostion**: (_Proposition 2.9_)
For $(\nu, \alpha)$-sub-exponential:
$$\mathbb{P}(X-\mu \geq t) = \begin{cases}
e^{\frac{-t^2}{2\nu^2}} & t \in [0, \frac{\nu^2}{\alpha}] \\
e^{\frac{-t}{2\alpha}} & t > \frac{\nu^2}{\alpha}
\end{cases} $$
_**Proof**_ (skipped in class)
First, we combine the definition of sub-exponential and Chernoff's bound:
$$\mathbb{P}[X\geq t] \leq e^{-\lambda t} \mathbb{E}[e^{\lambda X}] \leq exp(-\lambda t+\frac{\lambda^2\nu^2}{2}) \text{ for all } \lambda \in [0, \alpha^{-1}).$$
Let $g(t)=exp(-\lambda t+\frac{\lambda^2\nu^2}{2})$, we compute the quantity of $g^*(t) := \inf_{\lambda \in [0, \alpha^{-1})} g(\lambda, t)$.
1. When unconstrained, $g^*(x)$ is at minimum $-\frac{t^2}{2\nu^2}$ when $\lambda^*=\frac{t}{\nu^2}$. This is only valid when $t \in [0, \frac{\nu^2}{\alpha}]$ because $\lambda \leq \frac 1 \alpha$.
2. Otherwise, we have $t \geq \frac{\nu^2}{\alpha}$. Since the function $g(\cdot, t)$ is monotonically decreasing in the interval $[0, \lambda^*)$, the minimum is achieved at the boundary point $\lambda = \alpha^{-1}$, where we have $$g^*(t) = \frac t \alpha + \frac{1}{2\alpha} \frac{\nu^2}{\alpha} \leq -\frac{t}{2\alpha}$$ given that $t \geq \frac{\nu^2}{\alpha}$.
> This class of sub-exponential random variable has a less clean tail bounds than that of sub-Gaussian's, but has a lot more power. For sub-Gaussian variables, it's very easy to do some operations and end up escaping the class.
**Fact** (left as exercise by Martin):
Given $X_i \sim$ sub-exp($\nu_i, \alpha_i$) and i.i.d., $\sum_{i=1}^m X_i$ is sub-exp($\sqrt(\sum_{i=1}^n \nu_i^2)$, $\max_{i} \alpha_i$)
> Intuition: $\nu$ adds up like a variance (sub-Gaussian), $\alpha$ takes the loosest bound.
$P(\sum_{i=1}^n (X_i-\mu_i) \geq t) = e^{\frac{-t^2}{2\nu_*^2}}$ for $t \in [0, \frac{\nu_*^2}{\alpha_*}]$ (optimize for Chernoff bound), or $=e^{\frac{-t}{2\alpha_*}}$ for $t > \frac{\nu_*^2}{\alpha_*}$ (boundary point).
_**Proof**_ (_end of p. 28 & beginning of p.29_)
$$\mathbb{E}[e^{\lambda \sum_{i=1}^n (X_i-\mu_i)}] = \mathbb{E}[\Pi_{i=1}^n e^{\lambda(X_i-\mu_i)}]= \Pi_{i=1}^n \mathbb{E}[e^{\lambda(X_i-\mu_i)}] \leq \Pi_{i=1}^n e^{\lambda^2 \nu_i^2/2} \leq e^{\sum_{i=1}^n \lambda^2 \nu_i^2/2} = e^{\lambda^2 \sum_{i=1}^n \nu_i^2/2}$$ for $\lambda_i < \frac 1 {\alpha_i}$.
Since we want to find some $\alpha^*$ such that $\forall i, \lambda_i < \frac 1 {\alpha^*}$, we take the tightest bound, which is minimum $\frac 1 {\alpha^*}$, which should be $\alpha^* = \max_i \alpha_i$.
Therefore, $\sum_{i=1}^m X_i$ is sub-exp($\sqrt(\sum_{i=1}^n \nu_i^2)$, $\max_{i} \alpha_i$)
> Note from Yan: I worked out the proof and got the opposite conclusion for $\alpha$.
**Special case**: i.i.d. all the same parameters. By definition of i.i.d. we have that $\nu_i = \nu, \alpha_i = \alpha, \mu_i = \mu$ which means that $\nu{}_{*} = \sqrt{n}\nu, \alpha_* = \alpha$. Allowing $t = n\delta$, we plug in to get $\mathbb{P}(\sum_{i=1}^n (X_i-\mu_i) \geq t) \leq
\begin{cases}e^{-n\delta^2/2\nu^2} & t \in [0,nv^2/\alpha]\\
e^{-(n\delta)^2/2\alpha} & t > nv^2/\alpha
\end{cases}$
**Example**: $\chi^2$-variance $z=\sum_{k=1}^n X_k^2, X_k \sim N(0,1) i.i.d., \mathbb{E}[z]=n$.
Last time:$X_k^2 \sim$ sub-exp$(2, 4)$, the sum of them is (by the fact above) sub-exp$(2\sqrt n, 4)$,
$$P(|(\frac{1}n \sum_{k=1}^n x_k^2)-1| \geq \delta) \leq 2e^{\frac{-n\delta^2}{8}}, \delta \in [0, 1]$$ (*) (will be used later!).
> High dimension Gaussian is like a ball 🏀 with radius $\sqrt n$ and all the mass is on a thin shell with outer radius $\sqrt n(1+\delta)$ and inner radius is $\sqrt n(1-\delta)$. The probability is given above.
> Asked: Is it like CLT on high dimension? Answer: kinda related, CLT says the sum is like Gaussian, this says the tail is like Gaussian. Assumption of this is stronger but conclusion is not asymtotic (asym or not asym? Didn't catch it).
### 2. Bernstein and consequences
**Bernstein's condition**:
(_Equation 2.15_)
Given a random variable $X$ with $\mu = E(X)$, $\sigma^2 = Var(X)$ with some parameter $b$, a Bernstein condition $BC(\sigma, b)$ holds if
$$|E[X-\mu]^k| \leq \frac 1 2k!\sigma^2b^{k-2}, k=2,3, 4...$$
**Idea**: If $|X-\mu|\leq b$, $E[(X-\mu)^k] \leq E[|X-\mu|^{k-2}|x-\mu|^2] \leq \sigma^2 * b^{k-2}$.
Bernstein's condition has link to MGFs ($E[e^{\lambda(X-\mu)}] = \sum_{k=0}^\infty \frac{\lambda^kE[(X-\mu)]^k}{k!}$)
> This condition basically requires the moment generating functions to not blow up too fast.
**Proposition**(Bernstein):
Under $BC(\sigma, b)$, we have
$$\mathbb{E}[e^{\lambda (X-\mu)}] \leq e^{\frac{\lambda^2 \sigma^2/2}{1-b|\lambda|}} \ \ \ \ for \ all \ |\lambda|<\frac 1 b$$
Moreover, the concentration inequality
$$P(X-\mu \geq t) \leq exp\{-\frac{t^2}{2(\sigma^2+bt)}\}$$ $(\leq \max\{e^{\frac{-t^2}{2\sigma^2}}, e^{-\frac{-t}{2b}}\})$
_He erased the () part and make it two regions._
_Sasha said the max is correct._
_Martins thinks he's spiritually correct but not sure about the detail._
> Special case of sub-exponential
> Truncating tails if the tails is vanishing quickly.
**Remark**: _Why is Bernstein better than Hoeffding if a bounded variable, i.e. $|X-\mu| \leq b$ is also sub-Gaussan?_
Example of a _bounded variable_: Given $X$, $P(X=1) = \mu, P(X=0) = 1-\mu$
Then $\mu \in [0, 1]$. $\sigma^2=\mu(1-\mu)$ which is maximized at $\mu=\frac 1 2$ and $\sigma^2 \leq \frac 1 4$. Let $b=1 \ \ (|X| \leq 1)$.
Hoeffding: $P(|X-\mu| \geq t) \leq 2e^{-2t^2}$. for sub-Gaussian($\frac 1 2$) (not capturing the finer details)
Bernstein: $P(|X-\mu| \geq t) \leq 2e^{\frac{-t^2}{2\mu(1-\mu)+2t}}$. As $\mu \rightarrow 1/0$, this bound is very very tight. (it has the variance information)
> With Bernstein bound, variables often converges with slow rate/fast rate. The slow rate usually contains the variance information at the bottom of the exponent. However, if variance is small, slow rate is actually the dominating rate and is tighter.
### 3. Randomized dimensionality reduction
**Def**: a random matrix $X \in R^{m\times d}, X_{ij} \sim N(0, 1) i.i.d.$
(m= embedding dimension, d=ambient dimension, m << d)
We could construct $F:R^d\rightarrow R^m$,
$$F(u)=\frac{Xu}{\sqrt m}$$
for data vector $u \in R^d$.
> _$X$ is data oblivious, meaning that its value doesn't depend on the actual data._
>**Rescaling by $\sqrt(m)$?**
$\mathbb{E}[||F(u)||_2^2] = u^T \mathbb{E}[\frac{X^TX}{m}]u=||u||_2^2$
Now, we check that this mapping satisfies the $\delta$-faithful $l_2$ embedding condition:
Denote $\Delta=\frac{u_i-u_j}{||u_i-u_j||_2}$, then
$$\frac{||F(u^i)-F(u^j)||_2^2}{||u^i-u^j||_2^2} = \frac{1}{m}||X(\frac{u_i-u_j}{||u_i-u_j||_2})||^2_2=\frac{1}{m}\sum_{i=1}^m(x_i^T\Delta)^2$$
This is a $\chi^2$-variable because $x_i^T\Delta \sim N(0, 1) \ \ i.i.d.$ (_Given that $X_{ij}$ i.i.d._)
$P(|\frac 1 m \sum_{i=1}^m (x_i^T\Delta)^2 - 1| \geq \delta) \leq 2e^{-m\delta^2/8} \ \ \ \forall \delta \in (0,1)$ from (*) the $\chi^2$ tail bound we had before.
So $P( \text{not }\delta-\text{faithful uniformly}) \leq {N \choose 2} 2e^{\frac{-m\delta^2}{8}} = \epsilon \ \ \ \text{for all } \delta \in (0, 1)$.
## 02/16 Lecture 4: Tail bounds for more general functions
### Thus far
We have talked about sum of independent ramdom variables
$$f(X_1, ..., X_n) = \frac 1 n\sum_{i=1}^n X_i$$
where $X_i$ are independent.
### Question: Can we move to more general functions f?
(_not found in textbook_)
**Example**: $f(X_1, .., X_n) = \max_{i=1, ...,n} X_i \sim N(0, \sigma^2)$.
In homework #1, we need to show $\mathbb{E}[f(X)] \leq \frac x {\sqrt{2\sigma^2 \log n}}$
In addition, $(f(X)-\mathbb{E}[f(x)]) \sim$ sub-Gaussian($\sigma$), which means that the tail is like $N(0, \sigma^2)$.
**Example**: (Rademacher and Gaussian complexity)
$A \subset \mathbb{R}^n$ bounded
> measure the size of set by randomness
$Z(A) = \sup_{a \in A, i=1} \sum_{i=1}^m \epsilon_ia_i$, $\epsilon_i$ i.i.d. and $P(\epsilon_i = +1)=P(\epsilon_i = -1)=\frac 1 2$.
Rademacher complexity is $R(A)=\mathbb{E}[Z(A)]$.
### Bounded differences inequality
(_2.2 second half_)
> Azuma-Hoeffding (dependent variables and general Gaussian structure)
→ (independence + sums) Hoeffding
→ (bounded difference property) bounded difference inequality
> Martin is not covering Azuma-Hoeffding during lecture, but he recommended reading about it.
Given $X = (X_1, ..., X_n)$, $\tilde{X}^k=(X_1, ..., X_{k-1}, \tilde{X_k}, ..., X_n)$.
> Note from Yan: Basically one of the entries is changed in the vecotr.
With this notion, $f: \mathbb{R}^n \rightarrow \mathbb{R}$ satisfies the **bounded different inequality** if, for each index $k=1, 2, ..., n$,
$$|f(X)-f(\tilde{X^k})| \leq L_k \ \ \ \ \ \forall X, \tilde{X} \in \mathbb{R}^n$$
Bounded difference for $f(X_1, .., X_n) = \max_{i=1, ...,n} X_i$, say $|X_k| \leq b_k$, $L_k=\frac {2b_k}{n}$
> Note from Yan: Intuitively the bound happens during the worst case is when $X_1, ..., X_{k-1}, X_{k+1}, ..., X_n = -b_k$ and $X_k$ is changed from $b_k$ to $-b_k$.
**Proposition**:
For $X_k$ independent, if $f$ satisfying $BD(L_k) \ \forall k=1, ..., n:$
$$P(|f(X)-\mathbb{E}[f(X)]| \geq t) \leq 2e^{-\frac{2t^2}{\sum_{k=1}^n L_k^2}} \ \ \ \forall t \geq 0$$
> Behaving like sub-Gaussian, $\frac{\sqrt{\sum_{k=1}^n L_k^2}}{2}$. Don't depend on one coordinate too strongly.
> Note from Yan: He didn't prove it during class. It is probably because the proof involves martingale difference sequence, which was not covered during class.
> Application in graph theory (_textbook 2.24_): clique number in random graph.
**Example** (Rademacher complexity concentration)
Let $\{\epsilon_k\}_{k=1}^n$ be an i.i.d. sequence of Rademacher variables.
> Notes from Yan: Rademacher variables take values -1 or +1 with probability 1/2.
Given a collection of vector $A \in \mathbb{R}^n$. Define the function such that
$$f(\epsilon_1, ..., \epsilon_i) = \sup_{a \in A, i=1} \sum_{i=1}^m \epsilon_ia_i$$
The expectation $R(A) := \mathbb{E}[f]$ is the **Rademacher complexity** of the set $A$.
Then, we claim that such $f$ satisfies bounded difference inequality. Consider the following equation:
$$f(\epsilon)-f(\tilde{\epsilon}^k) = \sup_{a \in A, i=1} \sum_{i=1}^m \epsilon_ia_i-\sup_{a \in A, i=1} \sum_{i=1}^m \tilde{\epsilon_i}a_i $$
$\tilde{\epsilon_i} = \epsilon_i \ \ \forall i \neq k$.
Let $\sup_{a \in A, i=1} \sum_{i=1}^m \epsilon_ia_i = \epsilon^T a^*$
Then,
$$f(\epsilon)-f(\tilde{\epsilon}^k) = \sup_{a \in A, i=1} \sum_{i=1}^m \epsilon_ia_i-\sup_{a \in A, i=1} \sum_{i=1}^m \tilde{\epsilon_i}a_i \\ \leq \epsilon^Ta^*-\tilde{\epsilon}^Ta^* = (\epsilon_k - \tilde{\epsilon}_k)a_k^* \leq 2*\sup_{a \in A}|a_k|=L_k$$
> Note from Yan: Here are some notations that I tried to decipher. Hope they could clear some confusions.
- $\epsilon$ vs. $\epsilon_i$: $\epsilon_i$ is an individual Rademacher variable and $\epsilon$ is the vector containing all these $\epsilon_i$.
- $\epsilon_k$ and $\tilde{\epsilon}_k$: they are also vectors like $\epsilon$ and follow the bounded difference notation
- $a^*$ vs. $a_i$ vs. $a_k^*$: $a_i$ is a vector drawn from set $A$, $a^*$ is a matrix whose columns are $a_i$, $a_k^*$ is the kth column.
**Example**: (U-statistics of order 2)
(_textbook 2.23_)
A **pairwise U-statstics** is defined as $$U=\frac{1}{n \choose 2} \sum_{i=1}^n \sum_{j=1, j < i}^m g(X_i, X_j)$$
for some $g: R^2 \rightarrow R, g(s, t)=g(t, s)$
$$E[U]=E_{X, X'}[g(X_1, X_2)]$$
> For instance, if $g(s, t) = |s-t|$, then $E[U] = E_{X_1, X_2}[|X_1-X_2|]$
If we impose the condition that $|g|_\infty \leq b$, we could obtain $BD$ with $L_k=\frac{4b}{n}$. (_Textbook Example 2.23_)
### Lipschitz functions of Gaussian variables
**Def**: $f:\mathbb{R}^n \rightarrow \mathbb{R}$ is **L-Lipschitz** w.r.t. some metric $g$:
$$|f(X)-f(\tilde{X})| \leq Lg(X, \tilde{X})$$
If we use the Euclidean norm $g(X, \tilde{X}) = |X-\tilde{X}|_2$, the following result holds:
**Theorem**: Suppose $f:\mathbb{R}^n \rightarrow \mathbb{R}$ is L-Lipschitz in $l_2$: $|f(x) - f(\tilde{X})| \leq L|X-\tilde{X}|_2 \ \ \ \forall X, \tilde{X} \in R^N$, then we have
$$E[e^{\lambda(f(x)-E[f(x)])}] \leq e^{\frac{\lambda^2 L^2}{2}} where X_i \sim N(0, 1)$$
> A mgf of a Gaussian with variance L^2 for tail behavior.
**Examples** for Lipschitz functions:
(a) $f(X_1, ..., X_n) = \max_{i=1, ..., n} X_i$ is 1-Lipschitz because the worst case after switching the sign the max is still the max.
$E[f(x)] \approx \sqrt{2\sigma^2\log n}$
(b) $f(X_1, ..., X_n) = |X|_2$, is also a 1-Lipschitz function
$$P(|X|_2-E[|X|_2]| \geq t) \leq 2e^{\frac{-t^2}{2}} \forall t \geq 0$$
**Example** (Singular values of random matrices) (_Textbook 2.32_)
$X \in R^{m *d}, X_ij \sim N(0, 1) i.i.d. \ \ \ n=md$
$$f(X) = \frac{|||X|||_2}{\sqrt(n)} = \max_{|u|_2 = 1, u \in R^d} \frac{||X_u||_2} {\sqrt{m}}$$
> max singular value of $\frac{X}{\sqrt{m}}$
What's the Lipschitz constant? $\frac 1 {\sqrt{m}}$
_Proof_:
$f(X) - f(\tilde{X}) = \frac {1}{\sqrt{m}} {|||X|||_2 - |||\tilde{X}|||_2}$
> Need to study: operator norm, linear norm, nuclear norm.
> Triangle inequality reversed
$\leq \frac {1}{\sqrt{m}} |||X-\tilde{X}|||_2$
Like to show that $f(X) - f(\tilde{X}) \leq \frac {1}{\sqrt{m}} \sqrt{\sum_{i=1}^m\sum_{j=1}^d (X_{ij} - \tilde{X}_{ij})^2}$
Frobenius norm
> Leave for exercise: $|||X-\tilde{X}|||_2<|||X-\tilde{X}|||_F$
$P(\frac{||X||_2}{\sqrt{m}} \geq \frac{E[|X|_2]}{\sqrt{m}} +t) \leq e^{\frac{-mt^2}{2}} \forall t \geq 0$
Why rescaling by $\sqrt{m}$?
Sample covariance $\hat{\Sigma} = \frac{X^TX}{m}$
With this rescaling, it's behavioring very close to Gaussian
> q: What's the expectation?
a: $E[\frac{||X|||_2}{\sqrt{m}}] \leq 1 + \sqrt{\frac d m}$ (aspect ratio)
Strongly recommend trying this out in numpy.
First develop concentration, then find the expectation.
$X = (X_1^T ... X_n^T) X_j \sim N(0, I) \\ \hat{\Sigma} = 1/m \sum_{i=1}^m X_iX_i^T$
Next time he wants to do more proofs.