# 02/28: Lecture 6: Analysis of Least Square
###### tags: `Yan`
[toc]
1. Analysis framework for constrained least square.
2. Examples (related Gaussian complexity)
## Recap
Vector $(y_1,..., y_d)^T = y = \Theta^* + \frac{\sigma}{\sqrt{n}} (w_1, ..., w_d)^T, \Theta^* \in H \subseteq \mathbb{R}^d, E[w]=0, W(w) = I_{d*d}$.
Special case: $w \sim N(0, I_{d\times d})$.
Notations:
- d = dimension
- n = sample size
- $\sigma$ = noise std.
Constraint: $\hat{\theta} \in arg min_{\theta \in H} ||y-\theta||^2_2$.
If the set $H$ is convex, this projection always exists.
> Martin drew a picture illustrating projection onto $H$
$H convex \rightarrow \hat{\theta} = \Pi_H(y)$ where $\Pi_H(y)$ is Euclidean projection onto H such that $\Pi_H : R^d \rightarrow argmin_{\theta \in H} ||u-\theta||^2_2$.
> What's about the noise? variances all equal. Homoskedasity.
**M-estimator**: $\hat{\theta} \in argmin_{\theta \in H} L(\theta;y)$.
**Z-estimator**: Find $\hat{\theta}$ by solving a fixed point equation.
Fixed point equation: i.e. $\nabla L(\theta, y) = 0$. In general, $\psi(\theta; y) = 0, \psi: H \rightarrow R^d$ convex differentiable.
## Analysis
**Proposition (Basic Inequality):** For any compact $H$ and any minimizer $\hat{\theta}$, the error $\hat{\Delta} = \hat{\theta} - \theta^*$ satisfies:
$$\frac 1 2 ||\hat{\Delta}||^2_2 \leq \frac{\sigma}{\sqrt{n}} \sum_{j=1}^d w_j\hat{\Delta_j}$$
If H s convex, same holds without factor 1/2.
Compactly $<w, \hat{\Delta}> = \sum_{j=1}^d w_j \hat{\Delta_j}$.
-------------------------------------
Example: $(H = R^d)$, no constraint.
$H$ is convex => $||\hat{\Delta}||_2^2 \leq \frac{\sigma}{\sqrt{n}} <w, \hat{\Delta}>$.
>**Is is true that $(w, u) \sim N(0, ||u||_2^2)$ for $w \sim N(0, I_{d*d})$?**
$u$ is not fixed in this case. Only true for fixed $u$.
$\hat{\Delta}$ is not fixed, a function depends on $w$ the noise.
By Cauchy-Schwarz: $||\hat{\Delta}||_2^2 \leq \frac{\sigma}{\sqrt{n}} \rightarrow <w, \hat{\Delta}> \leq \frac{\sigma}{\sqrt{n}} ||w||_2||\hat{\Delta}||_2$
Then
$$||\hat{\Delta}||_2 \leq \frac{\sigma}{\sqrt{n}} ||w||_2 \\ ||\hat{\Delta}||^2_2 \leq \frac{\sigma^2}{n} ||w||^2_2$$
$\hat{\Delta} = \theta - \theta^* = y -\theta^* = \frac{\sigma}{\sqrt{n}}w$ in this special case.
$E[||\hat{\theta} - \theta^*||_2^2] = E[\frac{\sigma^2}{\sqrt{n}} \sum_{j=1}^d w_j^2] = \frac{\sigma^2d}{n}$.
-------------------------------------
**Proof for the basic inequality:**
General strategy: Find some set $a \subseteq R^d s.t. \hat{\Delta} \in a$. Then $<w, \hat{\Delta}> \leq \sup_{\Delta \in a}<w, \Delta>$.
> Find the worse case of that set
> Sup is a Gaussian complexity when w is a Gaussian.
Proof: First we claim that by minimized fact of $\hat{\theta}$, assuming that $\theta^* \in H$ for well-specified case.
$$||y-\hat{\theta}||^2_2 \leq ||y-\theta^*||^2_2$$
Known:
- $\hat{\theta} = \theta^* + \hat{\Delta}$
- $y = \theta^* + \frac{\sigma}{\sqrt{n}} w$
Then,
$$||(\theta^* + \frac{\sigma}{\sqrt{n}} w) - (\theta^* + \hat{\Delta}) ||_2 \leq || \frac{\sigma}{\sqrt{n}} w||^2_2$$
$$\frac 1 2 ||\hat{\Delta}||^2_2 \leq \frac{\sigma}{\sqrt{n}} \sum_{j=1}^d w_j\hat{\Delta_j}$$ by expanding the squares out.
----
Example: $H=B_2(R) = \{\theta \in R^d | ||\theta||_1 \leq R \}$
> Tail bound: $P(||\hat{\theta} - \theta^*||_2 \geq \frac{\sigma^2 d}{n} + t) \leq ?$ Exercise
$||\theta||_1 = \sum_{j=1}^d |\theta|_j$.
A picture: L-1 ball is like a diamond on 2-d. $|\theta_1|+|\theta_2| \leq 1$
$|\hat{\theta}_1 \leq R|,|\hat{\theta}_2 \leq R|$, so $||\hat{\Delta}||_1 \leq 2R, \hat{\Delta} = \hat{\theta} - \theta^*$.
Hence $<w, \hat{\Delta}> \leq \sup_{||\Delta ||_1\leq 2R} <\Delta, w> \leq \sup_{||\Delta ||_1\leq 2R} ||\Delta||_1 ||w||_\infty$ (by Holders of $(1, \infty)$) $=2R ||w||_\infty$.
Combine with our Prop:
$$||\hat{\theta} - \theta^*||_2^2 \leq 2R||w||_\infty$$
So, $$E[||\hat{\theta}-\theta^*||_2^2] \leq \frac{2R\sigma}{\sqrt{n}} E||w||_\infty \leq \frac{2R\sigma}{\sqrt{n}} \sqrt{2 \log d}$$
The last holds if w is Gaussian.
----
Example: (weak $l_q$-sparsity)
$$B_0(R) = \{\theta \in R^d | ||\theta||_0 \leq R \sum_{j=1}^d I(\theta_j \neq 0) \leq R\}$$
"Hard-Sparsity" model.
$q \in [0, 1]$.
$$B_q(R) = \{\theta \in R^d | \sum_{j=1}^d ||\theta_j||^q \leq R\}$$ q=0 => B0, q=1 => B1.
> What can be said about constrained least-squares over $B_q(R)$?
$$E[|\hat{\theta}-\theta^*||_2^2 \leq c (\frac{\sigma^2\log d}{n})^{1-\frac q 2}, q \in [0, 1]$$