# 02/21 Lecture 5: Constrained least-squares, examples and analysis
###### tags: `Yan`
[toc]
## 1. Gaussian Lipschitz Concentration (Theorem 2.26)
**Definition (L-Lipschitz)**: a function $f:\mathbb{R}^n \rightarrow \mathbb{R}$ is $L$-Lipschitz with respect to the Euclidean norm $||\cdot||_2$ if
$$||f(x)-f(y)||_2 \leq L||x-y||_2$$
In addtion, for a differentiable function, the Lipschitz property guarantees that $||\nabla f(X)||_2 \leq L \ \ \forall x \in R^n$.
We would then like to prove that:
Let $X \sim N(0, I_{m \times n})$. The variable $f(X)-\mathbb{E}[f(X)]$ is sub-Gaussian with parameter at most $L$, and hence
$$\mathbb{P}[|f(X)- \mathbb{E}[f(X)]|\geq t] \leq 2e^{-\frac{t^2}{2L^2}} \ \ \ \forall t \geq 0$$
> This result guarantees that any $L$-Lipschitz function of a standard Gaussian random vector, regardless of the dimension, exhibits concentration like a scalar Gaussian variable with variance $L^2$.
-------------------------------------------------------------
**Lemma**: Suppose $f: R^n \rightarrow R$ is differentiable, then for any convex function $\phi: R\rightarrow R$, we have
$$E[\phi(f(X)-E(f(X)))] \leq E_{x, y}[\phi(\frac \pi 2 <\nabla f(X), Y>)]$$
where $X , Y \sim N(0, I_{n \times n})$ and are independent.
_Proof Sketch_:
> an interpolation method that exploits the rotation invariance of the Gaussian distribution.
For each $\theta \in [0, \frac \pi 2]$, consider the random vector $Z(\theta) \in \mathbb{R}^n$
$$Z_k(\theta) = X_k \sin(\theta)+Y_k \cos(\theta), \ \ k=1, 2, ..., n$$
> $Z(0) = y, Z(\frac \pi 2)= x$
By the convexity of $\phi$, we have
$$\mathbb{E}_X[\phi(f(X) - \mathbb{E}_Y[f(Y)])] \leq \mathbb{E}_{X, Y}[\phi(f(X)- f(Y))]$$
Since each $X_k, Y_k \sim N(0, 1)$ are i.i.d, $Z_k$ is also standard Gaussian. The variance is $\sqrt{\sin^2\theta+\cos^2 \theta}=1$.
The elementwise derivative is
$$Z'_k(\theta) = X_k\cos(\theta) - Y_k \sin(\theta)$$
These variables are uncorrelated. Uncorrelated joint Gaussians are independent.
> *Some correlation analysis that I didn't understand. Seems to involve matching terms one by one.*
Note from Yan: I digged into what was going on here again, and it looks like we could just calculate the covariance between Z and Z' and it will cancel out. I'm not gonna expand in here, but feel free to add it if you want.
Let $\psi(\theta) = f(Z(\theta))$.
$$f(X)-f(Y) = f(Z(\frac \pi 2)-Z(0)) = \int_{0}^{\frac \pi 2} \psi'(\theta) d\theta$$
> Martin: Extra randomness is your friend. Like Gaussian and Rademacher.
-------------------------------------------------------------
Use lemma to prove sub-Gaussianity.
The convexity of $t \rightarrow e^{\lambda t}$ yields
$$E[e^{\lambda f(X)-E(f(X'))}] \leq E_{x, y}[e^{\lambda(\frac \pi 2 <\nabla f(X), Y>)}]$$
With $x$ fixed, $<\nabla f(x), Y> = N(0, ||\nabla f(X)||_2^2)$. Then, using MGF for Gaussian variable
$$E_{x, y}[e^{\lambda(\frac \pi 2 <\nabla f(X), Y>)}] = E_{x}[e^{\lambda^2(\frac{\pi^2}{4} \frac {||\nabla f(X)||_2^2}{2})}] \leq e^{\frac{\lambda^2\pi^2 L^2}{8}}$$
The last inequality is because the gradients are bounded by $L$.
Therefore, we have proven that this is a sub-Gaussian variable with parameter $\frac{\pi L}{2}$. Then $\frac{\pi}{2} L > L$ and $f(x) \sim$ sub-Gaussian($L$).
> Martin: Proofs are like broccoli. Tastes bad but useful.
----
### Example: Gaussian Complexity
Given a collection of vectors $A \subset R^n$ is compact. Define the random variable
$$z(A) := \sup_{a \in A} \sum_{i=1}^n a_iW_i = \sup_{a \in A} <a, W>, \ \ W_i \sim N(0,1)$$
Define a function $f: R^n \rightarrow R$
$f: <w_i, ..., w_n> \rightarrow \sup_{a \in A} <a, w>$
Can show (exercise)
$f$ is $L$-Lipschitz $L=\sup_{a\in A} ||a||_2$.
> Concentration, Lugosi et al.
## 2. Constrained least-squares
> Note: Everything below might be unpolished because I didn't understand what I wrote and didn't find the corresponding section in book. This part is basically listing a few motivating examples of constrained least-square examples. The main stuff is going to be presented n the later lectures.
Let $\theta^* \in \mathbb{R}^d$ be an unknown vector, referred to as the regression vector. Suppose we observe $Y = (y_1, ..., y_d)^T = \theta^* + \frac{\sigma}{\sqrt{n}}(w_1, ..., w_n)^T$ where $(w_1, ..., w_n)^T$ is the normalized Gaussian noise. $w_i \sim i.i.d., E[w_i]=0, Var(w_i)=1$.
The task is to estimate $\theta^* \in H \in R^d$, where $H$ is some constrained space.
----
Example: (Gaussian sequence model) $Z_i \in R^d, i=1,...,n$, want to estimate $\theta^* \in R^d$.
$Y = \frac{1}{n} \sum_{i=1}^n Z_i = \theta^* + \frac{\sigma}{\sqrt{n}} w$ is the sample mean, $w= \frac{1}{\sigma\sqrt{n}}\sum_{i=1}^n(Z_i-\theta^*), cov(Z) = \sigma^2I_{d*d}, w \in R^d$
$W_i \sim N(0,1)$, Gaussian sequence model.
$Y \sim N(\theta^*, \frac{\sigma^2}{n}I_{d * d})$
$\hat{\theta} = argmin_{\theta \in H} ||y-\theta||_2^2$.
> Notes from Yan: I'm not too sure what this example is. It's kinda messed up.
----
Example: (Linear Regression)
Given $X \in R^{n \times p}, Y \in R^n$
$$(y_1, ..., y_n)^T = (x_1^T, ..., x_n^T)^T\beta^*+\sigma (w_1, ..., w_n)^T, \beta^* \in R^P$$
We convert this to the definition above
$(y_1, ..., y_n)^T/\sqrt{n} = (x_1^T, ..., x_n^T)^T\beta^*\sqrt{n}+\sigma\sqrt{n} (w_1, ..., w_n)^T$
Let $(y_1, ..., y_n)^T/\sqrt{n}=Y$, $(x_1^T, ..., x_n^T) \sqrt{n}=\theta^*$.
Constraint sets:
1. Standard OLS: $H = \{\theta \in R^n | \theta \in Range(X)\}$
> What is the range set of X? It is a subspace with dimension Rank(X).
2. Ridge OLS: $\theta=X\beta/\sqrt(n) \ \ \exists \beta \in R^P, ||\beta||_2 \leq R$.
Example (Fixed design non-parametric regression)
$\tilde{y_i}/\sqrt{n} = f^*(x_i)/\sqrt{n} + \sigma/\sqrt{n} w_i, i=1, ..., n$, $f^*: X \rightarrow R$
$Y=\theta^* + \frac{\sigma}{\sqrt{n}}w$, $d=n, \theta_i^* = \frac{f^*(X_i)}{\sqrt{n}}$.
$\{x_i\}_{i=1}$ are fixed/not random, $w_i$ are random.
$f^* \in F$ could be any function class like neural networks, boosted regression, kernel method.
> Question: Why is non-parametric?
It's hard to answer precisely, one definition is related to the growth rate.
$F(x_1^n) = H = \{ \theta \in R^n | \theta_ = \frac{f(X_i)}{\sqrt{n}}, i=1, ..., n \ \ for \ some \ f\in F \}$ for $x^n = \{x_i\}_{i=1}^n$.