# 02/28: Lecture 6: Analysis of Least Square ###### tags: `Yan` [toc] 1. Analysis framework for constrained least square. 2. Examples (related Gaussian complexity) ## Recap Vector $(y_1,..., y_d)^T = y = \Theta^* + \frac{\sigma}{\sqrt{n}} (w_1, ..., w_d)^T, \Theta^* \in H \subseteq \mathbb{R}^d, E[w]=0, W(w) = I_{d*d}$. Special case: $w \sim N(0, I_{d\times d})$. Notations: - d = dimension - n = sample size - $\sigma$ = noise std. Constraint: $\hat{\theta} \in arg min_{\theta \in H} ||y-\theta||^2_2$. If the set $H$ is convex, this projection always exists. > Martin drew a picture illustrating projection onto $H$ $H convex \rightarrow \hat{\theta} = \Pi_H(y)$ where $\Pi_H(y)$ is Euclidean projection onto H such that $\Pi_H : R^d \rightarrow argmin_{\theta \in H} ||u-\theta||^2_2$. > What's about the noise? variances all equal. Homoskedasity. **M-estimator**: $\hat{\theta} \in argmin_{\theta \in H} L(\theta;y)$. **Z-estimator**: Find $\hat{\theta}$ by solving a fixed point equation. Fixed point equation: i.e. $\nabla L(\theta, y) = 0$. In general, $\psi(\theta; y) = 0, \psi: H \rightarrow R^d$ convex differentiable. ## Analysis **Proposition (Basic Inequality):** For any compact $H$ and any minimizer $\hat{\theta}$, the error $\hat{\Delta} = \hat{\theta} - \theta^*$ satisfies: $$\frac 1 2 ||\hat{\Delta}||^2_2 \leq \frac{\sigma}{\sqrt{n}} \sum_{j=1}^d w_j\hat{\Delta_j}$$ If H s convex, same holds without factor 1/2. Compactly $<w, \hat{\Delta}> = \sum_{j=1}^d w_j \hat{\Delta_j}$. ------------------------------------- Example: $(H = R^d)$, no constraint. $H$ is convex => $||\hat{\Delta}||_2^2 \leq \frac{\sigma}{\sqrt{n}} <w, \hat{\Delta}>$. >**Is is true that $(w, u) \sim N(0, ||u||_2^2)$ for $w \sim N(0, I_{d*d})$?** $u$ is not fixed in this case. Only true for fixed $u$. $\hat{\Delta}$ is not fixed, a function depends on $w$ the noise. By Cauchy-Schwarz: $||\hat{\Delta}||_2^2 \leq \frac{\sigma}{\sqrt{n}} \rightarrow <w, \hat{\Delta}> \leq \frac{\sigma}{\sqrt{n}} ||w||_2||\hat{\Delta}||_2$ Then $$||\hat{\Delta}||_2 \leq \frac{\sigma}{\sqrt{n}} ||w||_2 \\ ||\hat{\Delta}||^2_2 \leq \frac{\sigma^2}{n} ||w||^2_2$$ $\hat{\Delta} = \theta - \theta^* = y -\theta^* = \frac{\sigma}{\sqrt{n}}w$ in this special case. $E[||\hat{\theta} - \theta^*||_2^2] = E[\frac{\sigma^2}{\sqrt{n}} \sum_{j=1}^d w_j^2] = \frac{\sigma^2d}{n}$. ------------------------------------- **Proof for the basic inequality:** General strategy: Find some set $a \subseteq R^d s.t. \hat{\Delta} \in a$. Then $<w, \hat{\Delta}> \leq \sup_{\Delta \in a}<w, \Delta>$. > Find the worse case of that set > Sup is a Gaussian complexity when w is a Gaussian. Proof: First we claim that by minimized fact of $\hat{\theta}$, assuming that $\theta^* \in H$ for well-specified case. $$||y-\hat{\theta}||^2_2 \leq ||y-\theta^*||^2_2$$ Known: - $\hat{\theta} = \theta^* + \hat{\Delta}$ - $y = \theta^* + \frac{\sigma}{\sqrt{n}} w$ Then, $$||(\theta^* + \frac{\sigma}{\sqrt{n}} w) - (\theta^* + \hat{\Delta}) ||_2 \leq || \frac{\sigma}{\sqrt{n}} w||^2_2$$ $$\frac 1 2 ||\hat{\Delta}||^2_2 \leq \frac{\sigma}{\sqrt{n}} \sum_{j=1}^d w_j\hat{\Delta_j}$$ by expanding the squares out. ---- Example: $H=B_2(R) = \{\theta \in R^d | ||\theta||_1 \leq R \}$ > Tail bound: $P(||\hat{\theta} - \theta^*||_2 \geq \frac{\sigma^2 d}{n} + t) \leq ?$ Exercise $||\theta||_1 = \sum_{j=1}^d |\theta|_j$. A picture: L-1 ball is like a diamond on 2-d. $|\theta_1|+|\theta_2| \leq 1$ $|\hat{\theta}_1 \leq R|,|\hat{\theta}_2 \leq R|$, so $||\hat{\Delta}||_1 \leq 2R, \hat{\Delta} = \hat{\theta} - \theta^*$. Hence $<w, \hat{\Delta}> \leq \sup_{||\Delta ||_1\leq 2R} <\Delta, w> \leq \sup_{||\Delta ||_1\leq 2R} ||\Delta||_1 ||w||_\infty$ (by Holders of $(1, \infty)$) $=2R ||w||_\infty$. Combine with our Prop: $$||\hat{\theta} - \theta^*||_2^2 \leq 2R||w||_\infty$$ So, $$E[||\hat{\theta}-\theta^*||_2^2] \leq \frac{2R\sigma}{\sqrt{n}} E||w||_\infty \leq \frac{2R\sigma}{\sqrt{n}} \sqrt{2 \log d}$$ The last holds if w is Gaussian. ---- Example: (weak $l_q$-sparsity) $$B_0(R) = \{\theta \in R^d | ||\theta||_0 \leq R \sum_{j=1}^d I(\theta_j \neq 0) \leq R\}$$ "Hard-Sparsity" model. $q \in [0, 1]$. $$B_q(R) = \{\theta \in R^d | \sum_{j=1}^d ||\theta_j||^q \leq R\}$$ q=0 => B0, q=1 => B1. > What can be said about constrained least-squares over $B_q(R)$? $$E[|\hat{\theta}-\theta^*||_2^2 \leq c (\frac{\sigma^2\log d}{n})^{1-\frac q 2}, q \in [0, 1]$$