02/21 Lecture 5: Constrained least-squares, examples and analysis

# 02/21 Lecture 5: Constrained least-squares, examples and analysis ###### tags: `Yan` [toc] ## 1. Gaussian Lipschitz Concentration (Theorem 2.26) **Definition (L-Lipschitz)**: a function $f:\mathbb{R}^n \rightarrow \mathbb{R}$ is $L$-Lipschitz with respect to the Euclidean norm $||\cdot||_2$ if $$||f(x)-f(y)||_2 \leq L||x-y||_2$$ In addtion, for a differentiable function, the Lipschitz property guarantees that $||\nabla f(X)||_2 \leq L \ \ \forall x \in R^n$. We would then like to prove that: Let $X \sim N(0, I_{m \times n})$. The variable $f(X)-\mathbb{E}[f(X)]$ is sub-Gaussian with parameter at most $L$, and hence $$\mathbb{P}[|f(X)- \mathbb{E}[f(X)]|\geq t] \leq 2e^{-\frac{t^2}{2L^2}} \ \ \ \forall t \geq 0$$ > This result guarantees that any $L$-Lipschitz function of a standard Gaussian random vector, regardless of the dimension, exhibits concentration like a scalar Gaussian variable with variance $L^2$. ------------------------------------------------------------- **Lemma**: Suppose $f: R^n \rightarrow R$ is differentiable, then for any convex function $\phi: R\rightarrow R$, we have $$E[\phi(f(X)-E(f(X)))] \leq E_{x, y}[\phi(\frac \pi 2 <\nabla f(X), Y>)]$$ where $X , Y \sim N(0, I_{n \times n})$ and are independent. _Proof Sketch_: > an interpolation method that exploits the rotation invariance of the Gaussian distribution. For each $\theta \in [0, \frac \pi 2]$, consider the random vector $Z(\theta) \in \mathbb{R}^n$ $$Z_k(\theta) = X_k \sin(\theta)+Y_k \cos(\theta), \ \ k=1, 2, ..., n$$ > $Z(0) = y, Z(\frac \pi 2)= x$ By the convexity of $\phi$, we have $$\mathbb{E}_X[\phi(f(X) - \mathbb{E}_Y[f(Y)])] \leq \mathbb{E}_{X, Y}[\phi(f(X)- f(Y))]$$ Since each $X_k, Y_k \sim N(0, 1)$ are i.i.d, $Z_k$ is also standard Gaussian. The variance is $\sqrt{\sin^2\theta+\cos^2 \theta}=1$. The elementwise derivative is $$Z'_k(\theta) = X_k\cos(\theta) - Y_k \sin(\theta)$$ These variables are uncorrelated. Uncorrelated joint Gaussians are independent. > *Some correlation analysis that I didn't understand. Seems to involve matching terms one by one.* Note from Yan: I digged into what was going on here again, and it looks like we could just calculate the covariance between Z and Z' and it will cancel out. I'm not gonna expand in here, but feel free to add it if you want. Let $\psi(\theta) = f(Z(\theta))$. $$f(X)-f(Y) = f(Z(\frac \pi 2)-Z(0)) = \int_{0}^{\frac \pi 2} \psi'(\theta) d\theta$$ > Martin: Extra randomness is your friend. Like Gaussian and Rademacher. ------------------------------------------------------------- Use lemma to prove sub-Gaussianity. The convexity of $t \rightarrow e^{\lambda t}$ yields $$E[e^{\lambda f(X)-E(f(X'))}] \leq E_{x, y}[e^{\lambda(\frac \pi 2 <\nabla f(X), Y>)}]$$ With $x$ fixed, $<\nabla f(x), Y> = N(0, ||\nabla f(X)||_2^2)$. Then, using MGF for Gaussian variable $$E_{x, y}[e^{\lambda(\frac \pi 2 <\nabla f(X), Y>)}] = E_{x}[e^{\lambda^2(\frac{\pi^2}{4} \frac {||\nabla f(X)||_2^2}{2})}] \leq e^{\frac{\lambda^2\pi^2 L^2}{8}}$$ The last inequality is because the gradients are bounded by $L$. Therefore, we have proven that this is a sub-Gaussian variable with parameter $\frac{\pi L}{2}$. Then $\frac{\pi}{2} L > L$ and $f(x) \sim$ sub-Gaussian($L$). > Martin: Proofs are like broccoli. Tastes bad but useful. ---- ### Example: Gaussian Complexity Given a collection of vectors $A \subset R^n$ is compact. Define the random variable $$z(A) := \sup_{a \in A} \sum_{i=1}^n a_iW_i = \sup_{a \in A} <a, W>, \ \ W_i \sim N(0,1)$$ Define a function $f: R^n \rightarrow R$ $f: <w_i, ..., w_n> \rightarrow \sup_{a \in A} <a, w>$ Can show (exercise) $f$ is $L$-Lipschitz $L=\sup_{a\in A} ||a||_2$. > Concentration, Lugosi et al. ## 2. Constrained least-squares > Note: Everything below might be unpolished because I didn't understand what I wrote and didn't find the corresponding section in book. This part is basically listing a few motivating examples of constrained least-square examples. The main stuff is going to be presented n the later lectures. Let $\theta^* \in \mathbb{R}^d$ be an unknown vector, referred to as the regression vector. Suppose we observe $Y = (y_1, ..., y_d)^T = \theta^* + \frac{\sigma}{\sqrt{n}}(w_1, ..., w_n)^T$ where $(w_1, ..., w_n)^T$ is the normalized Gaussian noise. $w_i \sim i.i.d., E[w_i]=0, Var(w_i)=1$. The task is to estimate $\theta^* \in H \in R^d$, where $H$ is some constrained space. ---- Example: (Gaussian sequence model) $Z_i \in R^d, i=1,...,n$, want to estimate $\theta^* \in R^d$. $Y = \frac{1}{n} \sum_{i=1}^n Z_i = \theta^* + \frac{\sigma}{\sqrt{n}} w$ is the sample mean, $w= \frac{1}{\sigma\sqrt{n}}\sum_{i=1}^n(Z_i-\theta^*), cov(Z) = \sigma^2I_{d*d}, w \in R^d$ $W_i \sim N(0,1)$, Gaussian sequence model. $Y \sim N(\theta^*, \frac{\sigma^2}{n}I_{d * d})$ $\hat{\theta} = argmin_{\theta \in H} ||y-\theta||_2^2$. > Notes from Yan: I'm not too sure what this example is. It's kinda messed up. ---- Example: (Linear Regression) Given $X \in R^{n \times p}, Y \in R^n$ $$(y_1, ..., y_n)^T = (x_1^T, ..., x_n^T)^T\beta^*+\sigma (w_1, ..., w_n)^T, \beta^* \in R^P$$ We convert this to the definition above $(y_1, ..., y_n)^T/\sqrt{n} = (x_1^T, ..., x_n^T)^T\beta^*\sqrt{n}+\sigma\sqrt{n} (w_1, ..., w_n)^T$ Let $(y_1, ..., y_n)^T/\sqrt{n}=Y$, $(x_1^T, ..., x_n^T) \sqrt{n}=\theta^*$. Constraint sets: 1. Standard OLS: $H = \{\theta \in R^n | \theta \in Range(X)\}$ > What is the range set of X? It is a subspace with dimension Rank(X). 2. Ridge OLS: $\theta=X\beta/\sqrt(n) \ \ \exists \beta \in R^P, ||\beta||_2 \leq R$. Example (Fixed design non-parametric regression) $\tilde{y_i}/\sqrt{n} = f^*(x_i)/\sqrt{n} + \sigma/\sqrt{n} w_i, i=1, ..., n$, $f^*: X \rightarrow R$ $Y=\theta^* + \frac{\sigma}{\sqrt{n}}w$, $d=n, \theta_i^* = \frac{f^*(X_i)}{\sqrt{n}}$. $\{x_i\}_{i=1}$ are fixed/not random, $w_i$ are random. $f^* \in F$ could be any function class like neural networks, boosted regression, kernel method. > Question: Why is non-parametric? It's hard to answer precisely, one definition is related to the growth rate. $F(x_1^n) = H = \{ \theta \in R^n | \theta_ = \frac{f(X_i)}{\sqrt{n}}, i=1, ..., n \ \ for \ some \ f\in F \}$ for $x^n = \{x_i\}_{i=1}^n$.