# 03/07 Lecture 8: Noise Complexity: Adaptivity, Sparsity And Ellipses ###### tags: `Yan` Outline: 1. Adaptation to unknown sparsity with $l_1$-relaxation. 2. Sparse linear regression restricted eigenvalues. ## Adaptation in sparse sequence estimation Given $$y = \theta^* + \frac{\sigma}{\sqrt{n}} w, \theta^* \in R^d$$ we can find $\hat{\theta} = argmin_{||\theta ||_1 \leq R} ||y-\theta||_2$. **Prop:** If $||\theta^*||_1 = R$ (cheating because the radius shouldn't be determined) and $||\theta^*||_0 = k$. then, $E[||\hat{\theta}-\theta^*||_2^2] \leq c_1 \sigma^2 \frac{k \log d}{n}$. Proof: Claim that $||\hat{\Delta}||_1 \leq 2\sqrt{k}||\hat{\Delta}||_2$. Then we have $$\frac{\sigma}{\sqrt{n}} <w, \hat{\Delta}> \leq \frac{\sigma}{\sqrt{n}} ||w||_\infty ||\hat{\Delta}||_1 \leq \frac{\sigma}{\sqrt{n}} ||w||_\infty 2\sqrt{k}||\hat{\Delta}||_2$$ Then, by manipulating basic inequality, we have $$E||\hat{\Delta}||_2^2 \leq c \sigma^2 \frac k n E||w||_\infty^2$$ **Proof of Claim**: Define $s = \{j | \theta_j^* \neq 0\}$ where $|S|=k$ by assumption. $$||\theta^*||_1 = \sum_{j \in S} |\theta^*_j| := ||\theta^*_S||_1$$ Consider $$||\theta^* + \hat{\Delta} ||_1 = \sum_{j \in S} |\theta^*_j + \hat{\Delta_j}| + \sum_{j \in S^c} |\theta^*_j + \hat{\Delta_j}| \\ = ||\theta^*_S + \hat{\Delta_S}||_1 + ||\hat{\Delta_{S^c}}||_1 := ||\hat{\theta}||_1$$ By the cheat, we could assert that $||\hat{\theta}||_1 \leq R = ||\theta^*||_1.$ First inequality is because $\hat{\theta}$ is feasible. Then, $$||\theta^*_S + \hat{\Delta_S}||_1 + ||\hat{\Delta_{S^c}}||_1 \leq ||\theta_S^*||_1$$ By reverse triangle inequality $$||\theta_S^*||_1 - ||\hat{\Delta}_S||_1 + ||\hat{\Delta_{S^c}}||_1 \leq ||\theta^*_S + \hat{\Delta_S}||_1 + ||\hat{\Delta_{S^c}}||_1 \leq ||\theta_S^*||_1$$ Then, $$||\hat{\Delta}_{S^c} ||_1 \leq ||\hat{\Delta}_S||_1$$ Very strong constraint when S is small. We could finish up with $||\hat{\Delta} ||_1 = ||\hat{\Delta}_S||_1 + ||\hat{\Delta}_{S^c}||_1$. Then, $$||\hat{\Delta} ||_1 \leq 2 ||\hat{\Delta}_S||_1 \\ \leq 2\sqrt{k} ||\hat{\Delta}_S||_2 \\ \leq 2\sqrt{k} ||\hat{\Delta}||_2$$ ## Lagrangian/regularized estimator $\hat{\theta_\lambda} = \argmin_{\theta \in R^d} \{ \frac 1 2 ||y-\theta||_2^2 + \lambda ||\theta||_1\} = \argmin_{\theta \in R^d} f(\theta)$ > There is one-to-one mapping from the constraint problem to the regularized problem Prop: If $\lambda \geq 2\frac{\sigma}{\sqrt{n}} ||w||_\infty$, then $||\hat{\Delta}_\lambda || \leq c \sqrt{k} ||\hat{\Delta}_k||_2$. $c \approx 3?$ > Martin has PTSD about constants now after HW1 and he removed all of the constants to just be c in HW2 Then $E[\hat{\Delta}^\lambda]_2^2 \leq c'\frac{\sigma^2}{n} k \log d$ Basic Inequality says $$f(\hat{theta}) \leq f(\theta^*)$$ Unpacking this $$\frac 1 2 ||y-(\theta^*+\hat{\Delta})||_2^2 + \lambda ||\theta^*+\hat{\Delta}||_1 \leq \frac 1 2 ||y-\theta^*||_2^2 + \lambda ||\theta^*||_1$$ Then $$0 \leq \frac 1 2 ||\hat{\Delta}||_2^2 \\ \leq \frac{\sigma}{\sqrt{n}} <w, \hat{\Delta}> + \lambda (||\theta^*||_1 - ||\theta^*+\hat{\Delta}||_1) \\ \leq \frac{\sigma}{\sqrt{n}}||w||_\infty ||\hat{\Delta}||_1 +\lambda (||\theta^*||_1 - ||\theta^*+\hat{\Delta}||_1) $$ By the previous proof, $||\theta^*||_1 = ||\theta_S^*||_1$, $||\theta^*||_1 - ||\theta^* + \hat{\Delta}||_1 = ||\theta_S^*||_1 - \{||\theta_S^*+\hat{\Delta}_S||_1 + ||\hat{\Delta_{S^c}||_1} \\ \leq ||\hat{\Delta}_S||_1 - ||\hat{\Delta_{S^c}}||_1 \}$ Then, $$\frac{\sigma}{\sqrt{n}}||w||_\infty ||\hat{\Delta}||_1 +\lambda (||\theta^*||_1 - ||\theta^*+\hat{\Delta}||_1) \\ \leq \frac{\sigma}{\sqrt{n}}||w||_\infty ||\hat{\Delta}||_1 \\ \leq \lambda (||\hat{\Delta}_S||_1 - ||\hat{\Delta_{S^c}}||_1 ) \\ \leq \frac \lambda 2 (||\hat{\Delta}_S||_1 + ||\hat{\Delta}_{S^c}||_1) $$ Putting together $$0 \leq \frac 3 2 \lambda ||\hat{\Delta}_S||_1 - \frac \lambda 2 ||\hat{\Delta}_{S^c}||_1 \\ ||\hat{\Delta}_{S^c}||_1 \leq 3 ||\hat{\Delta}_S||_1$$ for $\lambda > 0$. > How does this serve as a conclusion? Needs to read more ## Sparse linear regression Observe $\{(x_i, y_i)\}_{i=1}^m$, $x_i \in R^d$ = covariate, $y_i \in R^d$ = response. Want to predict $y$ with linear function $x \rightarrow <\theta, x> = \sum_{j=1}^d \theta_j x_j$. > Maybe there is a sparse $||\theta||_0 = k << d$ that works very well. Haystack problem almost. Observe $y_i = <x_, \theta_i> + \sigma w_i$, $E[w_i] = 0, Var(w_i) = 1$. $E[y | x] = <\theta^*, x>$. Results to date: would allow us to bound the error. $(y_1, ..., y_i)^T = (x_1 ... x_n )^T \theta^* + \sigma w$. 1. $\frac 1 n ||X(\hat{\theta} - \theta^*)||_2^2 = \frac 1 n \sum_{i=1}^n (<x_i, \hat{\theta}> - <x_i, \theta^*>)^2$ 2. $||\hat{\theta} - \theta^*||_2^2 = \sum_{j=1}^d (\hat{\theta}_j -\theta^*_j)^2$. From 1., $\frac 1 n ||X(\hat{\theta} - \theta^*)||_2^2 = (\hat{\theta} - \theta^*) \hat{\sum}_n \hat{\theta} - \theta^*$ where $\hat{\sum}_n = \frac 1 n \sum_{i=1}^n x_i x_i^T$ From 2., $||\hat{\theta} - \theta^*||_2^2 = (\hat{\theta} - \theta^*) I (\hat{\theta} - \theta^*)$