# 02/07 Lecture 1: Introduction
[toc]
## 1. Logistics
1. **Homework**: There will be 4 psets. They will start easy and get harder at the end.
2. **Background**: The class will need background in probability, statistics, linear algebra, real analysis. Read chapter 2 of the textbook for further reference. In general, need some math maturity. Reach out if have questions about background.
3. **Materials**:
- Martin Wainwright, _High-Dimensional Statistics: A Non-Asymptotic Viewpoint_. Cambridge University Press.
- Philippe Rigollet and Jan-Christian Hütter, _High Dimensional Statistics_. Lecture Notes (MIT).
- Roman Vershynin, _High-Dimensional Probability: An Introduction with Applications in Data Science_. Cambridge University Press.
- Ramon van Handel, _Probability in High Dimensions_. Lecture Notes (Princeton).
4. **Extra help**: Some recitations will be scheduled at the beginning of the class to fill the gaps people might have. Office hours will also be scheduled.
## 2. What's the class about?
The class is mostly about non-traditional stats, machinery/theory/models in high-dim or non-asymtotic approach.
## 3. A Motivating Example: Covariance Estimation
_(Chapter 1, Wainwright)_
Suppose $X_i \in \mathbb{R}^d, \mathbb{E}[X]=0, \Sigma = \operatorname{cov}(X) \in \mathbb{R}^{d \times d}$.
> Stats Reminder: The covariance matrix of a random vector $X$, is basically the variance of $X$, which is a square matrix that contains all the covariances between the entries of the vector.
If we want to estimate the unknown $\Sigma$, given some random vectors $\{x_1, ..., x_n\}$ are drawn i.i.d. from some distribution in $\mathbb{R}^d$, we could use the _sample covariance matrix_
$$\hat{\Sigma} = \frac 1 n \sum_{i=1}^n x_ix_i^T$$
a $d \times d$ random matrix corresponding to the sample average of the outer products $x_i x_i^T \in \mathbb{R}^{d \times d}$.
> Notation: $\hat{}$ means data/observation, $n$ means the size of dataset from now on.
By construction, the sample covariance is unbiased, meaning that $\mathbb{E}[\hat{\Sigma}_n]=\Sigma$.
Three modes of analysis:
1. **Classical asymtotic analysis**: $\hat{\Sigma}_n \rightarrow \Sigma$ when $n$ goes infinity and $d$ stays fixed.
> This argument is not informative since, for example, in some analysis we only care about a finite small $n$ and some changing $d$.
2. **High-dimensional asymtotics**: What if we increase $d$ and $n$ but keep their ratio unchanged? To put it formally, $d \rightarrow \infty, n \rightarrow \infty, \frac d n \rightarrow \alpha \in (0, 1)$.
<u>Marcenko-Pastur law:</u>
Take a bunch of $(\hat{\Sigma}, \Sigma)$ indexed with $(n,d)$ and compute the eigenvalues of the random sample covariance matrix. $\Sigma= I_{d\times d}$ with eigenvalue all 1s so it should cluster around 1.
MP-law says the value is not exactly like a bell shape centered around 1.
3. **Non-asymtotic analysis**:
What's different in our approach is that we want to provide an *exact* non-asymptotic bounds.
> Define: $l_2$-operator norm. $M\in R^{d \times d}$, $l_2$-norm is $|||M|||_2 := \max_{||u||=1, u\in R^d} ||Mu||_2$. max eigenvalue
We could get some bound like this to measure exactly how much the sample deviates from the ground truth:
$P[|||\hat{\Sigma_n}-\Sigma|||_{l_2} \geq c_0 \max\{\sqrt{\frac d n}, \frac{d}{ n}\}+ t] \leq 2 e^{-c_1nt} \forall t \in [0,1]$
In this approach, terms all matter, they don't go away. The non-asymptotic approach implies all the classical asymptotic theory, but also gives us much more.
```
Overview of the class (Part I):
- Tail bounds and concentration inequalities
- Applications:
- mean estimation
- least-square estimators
- random sketching
- matrix estimation problems
- PCA
- spectral clustering
- touch on emprical process (briefly cover it today)
- structure and how to exploit
```
#### Touch on emprical process
How to analyze $|||\hat{\Sigma_n}-\Sigma|||_{l_2}$ ?
Unpacking a little bit:
By definition:
$$|||\hat{\Sigma_n}-\Sigma|||_{l_2} = \max_{||u||=1, u\in R^d} ||(\hat{\Sigma_n}-\Sigma)u||_2 \\
= \max_{||u||=1, u\in R^d} \max_{||v||=1, v\in R^d} v^T(\hat{\Sigma_n}-\Sigma)u$$
> Using the fact that $||a||_2=\max_{||b||=1} b^Ta$
Special case when $u=v$:
$$\max_{||u||=1, u\in R^d} u^T(\hat{\Sigma_n}-\Sigma)u\\
= \max \left\{ \frac 1 n \sum_{i=1}^n \langle x_i, u\rangle ^2 =E[<x_i ,u>^2]=E[u^Tx_ix_i^Tu]=E[u^T\Sigma u] \right\}$$
## 4. Basic Tail Bounds
_(Chap 2, Wainwright)_
### a) Some classical bounds
- **Markov**: Random variable $Z \geq 0$, $P(Z > t)=\frac{E[Z]}{t}$
- **Chebyshev**: Random variable $X, E[X]=\mu, Var(X)=\sigma$, $P(|X-\mu| \geq t) \geq \frac{\sigma}{t^2}$
- **Super p-th chebyshev**: $P(|X-\mu|^{2p} \geq t^{2p}) \geq \frac{E[(X-\mu)^{2p}]}{t^2}$
- **Chernoff bound**: $P((X-\mu) \geq t) = P(e^{\lambda(X-\mu)} \geq e^{\lambda t}) \leq \frac{E[e^{\lambda(X-\mu)}]}{e^{\lambda t}}$
Assuming Moment Generating Function, meaning that there is some constant $b > 0$ such that:
$$E[e^{\lambda(X-\mu)}] < \infty, \lambda \in (0, b)$$
> Stats Reminder: The nth moment of $X$ is $E[X^n]$
> Stats Reminder: Moment-generating function (MGF) is denoted by $M_X(t) = E[e^{tX}]$, provided that this expectation exists for $t$ in some neighborhood of $0$. Otherwise, we say the MGF doesn't exist.
MGF generates moment by Taylor expansion:
$$M_X(t)=E[e^{tX}]=1+tE[X] + \frac{t^2E(X^2)}{2!}+...$$
#### Classification of random variables based on tail bound by Chernoff
Ex. Cauchy has heavy tail 🐶
##### Gaussian tail bounds
Let $X \sim N(\mu, \sigma^2)$, by Chernoff bound, we have $$P[(X-\mu) \geq t]=P[e^{\lambda(X-\mu)} \geq e^{\lambda t}] \leq \frac{E[e^{\lambda(X-\mu)}]}{e^{\lambda t}}$$
$E[e^{\lambda (X-\mu)}] =$ ~~some integral, not even in textbook and prof got it wrong three times during lecture~~ $=e^{\frac{\lambda^2\sigma^2}{2}}$
We find the range of lambdas that result in finite MGF and optimize for the tightest bound.
$P((X-\mu) \geq t) \leq \inf_{\lambda \geq 0} [e^{\frac {\lambda^2\sigma^2}{2}-\lambda t}] = e^{-\frac{t^2}{2\sigma^2}}$ for all $t \geq 0$.
> _The derivation is not in the textbook, but should be easy_
Take the derivative of $\frac {\lambda^2\sigma^2}{2}-\lambda t$ with respect to $\lambda$ and set to 0, we get $\lambda=\frac{t}{\sigma^2}$. Then plug the $\lambda$ value in and get the value $-\frac{t^2}{2\sigma^2}$
Mill's ratio tells us that this bound is sharp up to polynomial factor corrections
> In the textbook Exercise 2.2 they derive it. Feel like it's gonna be on the pset ngl
### b) Sub-Gaussian
**Definition:**
A scalar random variable X with $\mu = e[X]$ is **$\sigma-$sub-Gaussian** if
$$\mathbb{E}[e^{\lambda (X-\mu)}] \leq e^{\frac{\lambda^2 \sigma^2}{2}} \forall \lambda \in R$$
> The lecture ends with this definition, he'll probably pick this up in the next lecture and prove it.