# 02/07 Lecture 1: Introduction ###### tags: `Yan` [toc] ## 1. Logistics 1. **Homework**: There will be 4 psets. They will start easy and get harder at the end. 2. **Background**: The class will need background in probability, statistics, linear algebra, real analysis. Read chapter 2 of the textbook for further reference. In general, need some math maturity. Reach out if have questions about background. 3. **Materials**: - Martin Wainwright, _High-Dimensional Statistics: A Non-Asymptotic Viewpoint_. Cambridge University Press. - Philippe Rigollet and Jan-Christian Hütter, _High Dimensional Statistics_. Lecture Notes (MIT). - Roman Vershynin, _High-Dimensional Probability: An Introduction with Applications in Data Science_. Cambridge University Press. - Ramon van Handel, _Probability in High Dimensions_. Lecture Notes (Princeton). 4. **Extra help**: Some recitations will be scheduled at the beginning of the class to fill the gaps people might have. Office hours will also be scheduled. ## 2. What's the class about? The class is mostly about non-traditional stats, machinery/theory/models in high-dim or non-asymtotic approach. ## 3. A Motivating Example: Covariance Estimation _(Chapter 1, Wainwright)_ Suppose $X_i \in \mathbb{R}^d, \mathbb{E}[X]=0, \Sigma = \operatorname{cov}(X) \in \mathbb{R}^{d \times d}$. > Stats Reminder: The covariance matrix of a random vector $X$, is basically the variance of $X$, which is a square matrix that contains all the covariances between the entries of the vector. If we want to estimate the unknown $\Sigma$, given some random vectors $\{x_1, ..., x_n\}$ are drawn i.i.d. from some distribution in $\mathbb{R}^d$, we could use the _sample covariance matrix_ $$\hat{\Sigma} = \frac 1 n \sum_{i=1}^n x_ix_i^T$$ a $d \times d$ random matrix corresponding to the sample average of the outer products $x_i x_i^T \in \mathbb{R}^{d \times d}$. > Notation: $\hat{}$ means data/observation, $n$ means the size of dataset from now on. By construction, the sample covariance is unbiased, meaning that $\mathbb{E}[\hat{\Sigma}_n]=\Sigma$. Three modes of analysis: 1. **Classical asymtotic analysis**: $\hat{\Sigma}_n \rightarrow \Sigma$ when $n$ goes infinity and $d$ stays fixed. > This argument is not informative since, for example, in some analysis we only care about a finite small $n$ and some changing $d$. 2. **High-dimensional asymtotics**: What if we increase $d$ and $n$ but keep their ratio unchanged? To put it formally, $d \rightarrow \infty, n \rightarrow \infty, \frac d n \rightarrow \alpha \in (0, 1)$. <u>Marcenko-Pastur law:</u> Take a bunch of $(\hat{\Sigma}, \Sigma)$ indexed with $(n,d)$ and compute the eigenvalues of the random sample covariance matrix. $\Sigma= I_{d\times d}$ with eigenvalue all 1s so it should cluster around 1. MP-law says the value is not exactly like a bell shape centered around 1. 3. **Non-asymtotic analysis**: What's different in our approach is that we want to provide an *exact* non-asymptotic bounds. > Define: $l_2$-operator norm. $M\in R^{d \times d}$, $l_2$-norm is $|||M|||_2 := \max_{||u||=1, u\in R^d} ||Mu||_2$. max eigenvalue We could get some bound like this to measure exactly how much the sample deviates from the ground truth: $P[|||\hat{\Sigma_n}-\Sigma|||_{l_2} \geq c_0 \max\{\sqrt{\frac d n}, \frac{d}{ n}\}+ t] \leq 2 e^{-c_1nt} \forall t \in [0,1]$ In this approach, terms all matter, they don't go away. The non-asymptotic approach implies all the classical asymptotic theory, but also gives us much more. ``` Overview of the class (Part I): - Tail bounds and concentration inequalities - Applications: - mean estimation - least-square estimators - random sketching - matrix estimation problems - PCA - spectral clustering - touch on emprical process (briefly cover it today) - structure and how to exploit ``` #### Touch on emprical process How to analyze $|||\hat{\Sigma_n}-\Sigma|||_{l_2}$ ? Unpacking a little bit: By definition: $$|||\hat{\Sigma_n}-\Sigma|||_{l_2} = \max_{||u||=1, u\in R^d} ||(\hat{\Sigma_n}-\Sigma)u||_2 \\ = \max_{||u||=1, u\in R^d} \max_{||v||=1, v\in R^d} v^T(\hat{\Sigma_n}-\Sigma)u$$ > Using the fact that $||a||_2=\max_{||b||=1} b^Ta$ Special case when $u=v$: $$\max_{||u||=1, u\in R^d} u^T(\hat{\Sigma_n}-\Sigma)u\\ = \max \left\{ \frac 1 n \sum_{i=1}^n \langle x_i, u\rangle ^2 =E[<x_i ,u>^2]=E[u^Tx_ix_i^Tu]=E[u^T\Sigma u] \right\}$$ ## 4. Basic Tail Bounds _(Chap 2, Wainwright)_ ### a) Some classical bounds - **Markov**: Random variable $Z \geq 0$, $P(Z > t)=\frac{E[Z]}{t}$ - **Chebyshev**: Random variable $X, E[X]=\mu, Var(X)=\sigma$, $P(|X-\mu| \geq t) \geq \frac{\sigma}{t^2}$ - **Super p-th chebyshev**: $P(|X-\mu|^{2p} \geq t^{2p}) \geq \frac{E[(X-\mu)^{2p}]}{t^2}$ - **Chernoff bound**: $P((X-\mu) \geq t) = P(e^{\lambda(X-\mu)} \geq e^{\lambda t}) \leq \frac{E[e^{\lambda(X-\mu)}]}{e^{\lambda t}}$ Assuming Moment Generating Function, meaning that there is some constant $b > 0$ such that: $$E[e^{\lambda(X-\mu)}] < \infty, \lambda \in (0, b)$$ > Stats Reminder: The nth moment of $X$ is $E[X^n]$ > Stats Reminder: Moment-generating function (MGF) is denoted by $M_X(t) = E[e^{tX}]$, provided that this expectation exists for $t$ in some neighborhood of $0$. Otherwise, we say the MGF doesn't exist. MGF generates moment by Taylor expansion: $$M_X(t)=E[e^{tX}]=1+tE[X] + \frac{t^2E(X^2)}{2!}+...$$ #### Classification of random variables based on tail bound by Chernoff Ex. Cauchy has heavy tail 🐶 ##### Gaussian tail bounds Let $X \sim N(\mu, \sigma^2)$, by Chernoff bound, we have $$P[(X-\mu) \geq t]=P[e^{\lambda(X-\mu)} \geq e^{\lambda t}] \leq \frac{E[e^{\lambda(X-\mu)}]}{e^{\lambda t}}$$ $E[e^{\lambda (X-\mu)}] =$ ~~some integral, not even in textbook and prof got it wrong three times during lecture~~ $=e^{\frac{\lambda^2\sigma^2}{2}}$ We find the range of lambdas that result in finite MGF and optimize for the tightest bound. $P((X-\mu) \geq t) \leq \inf_{\lambda \geq 0} [e^{\frac {\lambda^2\sigma^2}{2}-\lambda t}] = e^{-\frac{t^2}{2\sigma^2}}$ for all $t \geq 0$. > _The derivation is not in the textbook, but should be easy_ Take the derivative of $\frac {\lambda^2\sigma^2}{2}-\lambda t$ with respect to $\lambda$ and set to 0, we get $\lambda=\frac{t}{\sigma^2}$. Then plug the $\lambda$ value in and get the value $-\frac{t^2}{2\sigma^2}$ Mill's ratio tells us that this bound is sharp up to polynomial factor corrections > In the textbook Exercise 2.2 they derive it. Feel like it's gonna be on the pset ngl ### b) Sub-Gaussian **Definition:** A scalar random variable X with $\mu = e[X]$ is **$\sigma-$sub-Gaussian** if $$\mathbb{E}[e^{\lambda (X-\mu)}] \leq e^{\frac{\lambda^2 \sigma^2}{2}} \forall \lambda \in R$$ > The lecture ends with this definition, he'll probably pick this up in the next lecture and prove it.