# 03/07 Lecture 8: Noise Complexity: Adaptivity, Sparsity And Ellipses
###### tags: `Yan`
Outline:
1. Adaptation to unknown sparsity with $l_1$-relaxation.
2. Sparse linear regression restricted eigenvalues.
## Adaptation in sparse sequence estimation
Given
$$y = \theta^* + \frac{\sigma}{\sqrt{n}} w, \theta^* \in R^d$$
we can find $\hat{\theta} = argmin_{||\theta ||_1 \leq R} ||y-\theta||_2$.
**Prop:**
If $||\theta^*||_1 = R$ (cheating because the radius shouldn't be determined) and $||\theta^*||_0 = k$.
then, $E[||\hat{\theta}-\theta^*||_2^2] \leq c_1 \sigma^2 \frac{k \log d}{n}$.
Proof: Claim that $||\hat{\Delta}||_1 \leq 2\sqrt{k}||\hat{\Delta}||_2$.
Then we have
$$\frac{\sigma}{\sqrt{n}} <w, \hat{\Delta}> \leq \frac{\sigma}{\sqrt{n}} ||w||_\infty ||\hat{\Delta}||_1 \leq \frac{\sigma}{\sqrt{n}} ||w||_\infty 2\sqrt{k}||\hat{\Delta}||_2$$
Then, by manipulating basic inequality, we have
$$E||\hat{\Delta}||_2^2 \leq c \sigma^2 \frac k n E||w||_\infty^2$$
**Proof of Claim**:
Define $s = \{j | \theta_j^* \neq 0\}$ where $|S|=k$ by assumption.
$$||\theta^*||_1 = \sum_{j \in S} |\theta^*_j| := ||\theta^*_S||_1$$
Consider
$$||\theta^* + \hat{\Delta} ||_1 = \sum_{j \in S} |\theta^*_j + \hat{\Delta_j}| + \sum_{j \in S^c} |\theta^*_j + \hat{\Delta_j}| \\
= ||\theta^*_S + \hat{\Delta_S}||_1 + ||\hat{\Delta_{S^c}}||_1 := ||\hat{\theta}||_1$$
By the cheat, we could assert that $||\hat{\theta}||_1 \leq R = ||\theta^*||_1.$ First inequality is because $\hat{\theta}$ is feasible.
Then,
$$||\theta^*_S + \hat{\Delta_S}||_1 + ||\hat{\Delta_{S^c}}||_1 \leq ||\theta_S^*||_1$$
By reverse triangle inequality
$$||\theta_S^*||_1 - ||\hat{\Delta}_S||_1 + ||\hat{\Delta_{S^c}}||_1 \leq ||\theta^*_S + \hat{\Delta_S}||_1 + ||\hat{\Delta_{S^c}}||_1 \leq ||\theta_S^*||_1$$
Then,
$$||\hat{\Delta}_{S^c} ||_1 \leq ||\hat{\Delta}_S||_1$$
Very strong constraint when S is small.
We could finish up with $||\hat{\Delta} ||_1 = ||\hat{\Delta}_S||_1 + ||\hat{\Delta}_{S^c}||_1$.
Then,
$$||\hat{\Delta} ||_1 \leq 2 ||\hat{\Delta}_S||_1 \\
\leq 2\sqrt{k} ||\hat{\Delta}_S||_2 \\
\leq 2\sqrt{k} ||\hat{\Delta}||_2$$
## Lagrangian/regularized estimator
$\hat{\theta_\lambda} = \argmin_{\theta \in R^d} \{ \frac 1 2 ||y-\theta||_2^2 + \lambda ||\theta||_1\} = \argmin_{\theta \in R^d} f(\theta)$
> There is one-to-one mapping from the constraint problem to the regularized problem
Prop:
If $\lambda \geq 2\frac{\sigma}{\sqrt{n}} ||w||_\infty$, then $||\hat{\Delta}_\lambda || \leq c \sqrt{k} ||\hat{\Delta}_k||_2$.
$c \approx 3?$
> Martin has PTSD about constants now after HW1 and he removed all of the constants to just be c in HW2
Then $E[\hat{\Delta}^\lambda]_2^2 \leq c'\frac{\sigma^2}{n} k \log d$
Basic Inequality says
$$f(\hat{theta}) \leq f(\theta^*)$$
Unpacking this
$$\frac 1 2 ||y-(\theta^*+\hat{\Delta})||_2^2 + \lambda ||\theta^*+\hat{\Delta}||_1 \leq \frac 1 2 ||y-\theta^*||_2^2 + \lambda ||\theta^*||_1$$
Then
$$0 \leq \frac 1 2 ||\hat{\Delta}||_2^2 \\
\leq \frac{\sigma}{\sqrt{n}} <w, \hat{\Delta}> + \lambda (||\theta^*||_1 - ||\theta^*+\hat{\Delta}||_1) \\
\leq \frac{\sigma}{\sqrt{n}}||w||_\infty ||\hat{\Delta}||_1 +\lambda (||\theta^*||_1 - ||\theta^*+\hat{\Delta}||_1) $$
By the previous proof, $||\theta^*||_1 = ||\theta_S^*||_1$,
$||\theta^*||_1 - ||\theta^* + \hat{\Delta}||_1 = ||\theta_S^*||_1 - \{||\theta_S^*+\hat{\Delta}_S||_1 + ||\hat{\Delta_{S^c}||_1} \\
\leq ||\hat{\Delta}_S||_1 - ||\hat{\Delta_{S^c}}||_1
\}$
Then,
$$\frac{\sigma}{\sqrt{n}}||w||_\infty ||\hat{\Delta}||_1 +\lambda (||\theta^*||_1 - ||\theta^*+\hat{\Delta}||_1) \\
\leq \frac{\sigma}{\sqrt{n}}||w||_\infty ||\hat{\Delta}||_1 \\
\leq \lambda (||\hat{\Delta}_S||_1 - ||\hat{\Delta_{S^c}}||_1 ) \\
\leq \frac \lambda 2 (||\hat{\Delta}_S||_1 + ||\hat{\Delta}_{S^c}||_1)
$$
Putting together
$$0 \leq \frac 3 2 \lambda ||\hat{\Delta}_S||_1 - \frac \lambda 2 ||\hat{\Delta}_{S^c}||_1 \\
||\hat{\Delta}_{S^c}||_1 \leq 3 ||\hat{\Delta}_S||_1$$ for $\lambda > 0$.
> How does this serve as a conclusion? Needs to read more
## Sparse linear regression
Observe $\{(x_i, y_i)\}_{i=1}^m$, $x_i \in R^d$ = covariate, $y_i \in R^d$ = response.
Want to predict $y$ with linear function $x \rightarrow <\theta, x> = \sum_{j=1}^d \theta_j x_j$.
> Maybe there is a sparse $||\theta||_0 = k << d$ that works very well. Haystack problem almost.
Observe $y_i = <x_, \theta_i> + \sigma w_i$, $E[w_i] = 0, Var(w_i) = 1$.
$E[y | x] = <\theta^*, x>$.
Results to date: would allow us to bound the error.
$(y_1, ..., y_i)^T = (x_1 ... x_n )^T \theta^* + \sigma w$.
1. $\frac 1 n ||X(\hat{\theta} - \theta^*)||_2^2 = \frac 1 n \sum_{i=1}^n (<x_i, \hat{\theta}> - <x_i, \theta^*>)^2$
2. $||\hat{\theta} - \theta^*||_2^2 = \sum_{j=1}^d (\hat{\theta}_j -\theta^*_j)^2$.
From 1., $\frac 1 n ||X(\hat{\theta} - \theta^*)||_2^2 = (\hat{\theta} - \theta^*) \hat{\sum}_n \hat{\theta} - \theta^*$ where $\hat{\sum}_n = \frac 1 n \sum_{i=1}^n x_i x_i^T$
From 2., $||\hat{\theta} - \theta^*||_2^2 = (\hat{\theta} - \theta^*) I (\hat{\theta} - \theta^*)$