# 03/02 Lecture 7: Cont. on Contraints on Least Square
###### tags: `Yan`
## Outline
1. $l_0$-constraints for sparsity (Chap. 7, 7.1)
2. Fast rates for $l_1$-relaxation (Chap. 7, 7.1)
3. Touch on ellipses (didn't get to)
> **Before lecture:**
Martin: hw are meant to be hard. Don't be discouraged if u find it challenging. Hope it's not challenging in an annoying way and actually interesting.
`they are thinking very carefully about the problems` (not like they change the problem the day when it's due. Anyway)
If you find the homework too easy or too hard, you don't belong to this class :anguished:
We got an additional TA called Williams for the class, so more oh (Yay)
Martin: I'm happy to give you hint, but won't hold your hand 😲
## Constraint Least Squares
$H \in R^d$, $H$ compact
$$y = \Theta^* + \frac \sigma {\sqrt{n}} w, \Theta^* ]in H, w \in R^d$$
$\hat{\Theta} \in argmin_\Theta ||y-\Theta||_2$.
### Basic inequality
$||\hat{\Delta}||_2^2 \leq \frac{\sigma}{\sqrt{n}} <w, \hat{\Delta}>$ Deterministic (Stability analyhsis in optimization)
### Concentration (HW#2, $w \sim N(0, I_{d*d}$))
Show $$||\hat{\Delta}-\Theta^*||_2^2 \leq c \{E[||\hat{\Delta}-\Theta^*||_2^2] + \frac{\sigma^2}{\sqrt{n}}\log (1/\delta)\}$$
with probability at least $1-\delta$.
> Big picture is that the estimation is close to expectation with high probability $1-\delta$. The expectation is representative of the typical behavior.
> Martin tried to make an analogy to lottery but failed.
### Analysis of $E||\hat{\Delta}||_2^2$
----
Example ($l_0$-ball)
$$B_0(k) = \{\theta \in R^d | \ ||\theta||_0 \leq k \sum_{j=1}^d I(\theta_j \neq 0) \leq k\}$$
> $\hat{\Delta}$ is very dependent on $w$.
$$|<\Delta, w>| \leq ||\hat{\Delta}||_1||w||_\infty$$ by Holder
> l1 vs. l2 norms
$$||\Delta||_1 \leq \sqrt{d} ||\Delta||_2$$ by Cauchy-Schwarz. In high-dim analysis, we care about d since it could be high.
$$|<\Delta, w>| \leq ||\hat{\Delta}||_1||w||_\infty$$
$\hat{\Theta}, \Theta^*$ is at most k-sparse, so $\hat{\Delta}$ is 2k-sparse at most.
So we get $$\frac 1 2 ||\Delta||_2^2 \leq ||\Delta||_1 ||w||_\infty \leq \sqrt{2k} ||\Delta||_2 ||w||_\infty$$.
Then $$E[||\Delta||_2^2] \leq 2\sqrt{2k} E||w||_\infty$$
Eq: $w_i$ is 1-sub-Gaussian, zero-mean
From HW#1, we have that $E||w||_\infty \leq 2\sqrt{2 \log d}$
$$E||\hat{\Delta}||_2 \leq c \sigma \sqrt{\frac{k \log d}{n}}$$
Can show $E||\hat{\Delta}||_2 \leq c' \sigma^2 \frac{k \log d}{n}$
> If no structure on $\Theta^*$, we can prove $\frac{\sigma^2 d}{n}$ Is our new bound better?
If $\Theta$ is sparse, i.e. k<<d, good!
Otherwise i.e. k=O(d), not good!
> technique >> result. tho result not bad
> examples of sparse vectors: music, image processing
> soft sparse
$$B_q(R) = \{\theta \in R^d | \sum_{j=1}^d ||\theta_j||^q \leq R\}$$
----
Holder is pretty crude.
$$|<\hat{\Delta}, w>| = ||\hat{\Delta}||_2 <\hat{\Delta}/||\hat{\Delta}||_2, w>$$
We know $<\hat{\Delta}/||\hat{\Delta}||_2, w> \in B_0(2k) \in B_2(1)$
$$\max_{a \in B_0(2k), ||a||_2 = 1} <a, w> = \max_{|S|=2k} ||w_s||_2 = \max_{|S|=2k} \sum_{j \in S} w_j^2$$
$number \ of \ sets = n = {d\choose 2k}$ because 2k sparsity.
$$E[\max_{|S|=2k} ||w_S||_2||] = E[\max_{|S|=2k} ||w_S||_2|| - E[||w_S||_2]] + E[||w_S||]_2$$
When w standard Gaussian, $||w_S||_2|| - E[||w_S||_2]$ is 1-sub-Gaussian
$E[||w_S||]_2 \leq \sqrt{|S|} = \sqrt{2k}$
HW#1 said that the max of sub-Gaussian variable
$E[\max_{|S|=2k} ||w_S||_2|| - E[||w_S||_2]] \leq \sqrt{2 \log {d \choose 2k}} \leq c\{\sqrt{k \log (\frac{ed}{k})} + \sqrt{k} \}$
> Compare to the previous result, this is sharp and cannot be improved. This bound applies when the data is not sparse. Using a more careful analysis, we get better result.
> Martin wants his twins in class everyday. Kids are tiring but when they're gone he missed them.
----
$$z = \max_{a \in A \ finite} X_a, E[X_a] = 0$$
$X_a \sigma$-sub-Gaussian
Then, we had $E[z] \leq \sqrt{2\sigma^2 \log |A|}$.
$\lambda E[z] = \log (e^{\lambda E[z]}) \leq \log (E[e^{\lambda z}]), \lambda > 0$, second inequality by Jensen
$E[e^{\lambda z}] \leq \sum_{a \in A} E[e^{\lambda X_a}]$ (max leq sum)
> Martin says it's not a good move, but i use it all the time :(
$E[e^{\lambda z}] \leq \sum_{a \in A} e^{\frac{\lambda^2\sigma^2}{2}} = |A|e^{\frac{\lambda^2\sigma^2}{2}}$.
$$\lambda E[z] \leq \log |A| + \frac{\lambda^2\sigma^2}{2}$$
Then rescale for $\lambda$
$$ E[z] \leq \log |A|/\lambda + \frac{\lambda \sigma^2}{2} \ \forall \lambda > 0$$
> "OMG Do I need to get a sharp constant?" -- Martin Wainwright
Choose $\lambda = \sqrt{\frac{2\log |A|}{\sigma^2}}$, then
$$ E[z] \leq \log |A|/\lambda + \frac{\lambda \sigma^2}{2} = 2\sqrt{\frac{\sigma^2 \log |A|}{2}}$$ is what we wanted.
## Fast rate on l1-ball
> Application in Adaptation
Example: $B_1(R) = \{\theta \in R^d | \sum_{j=1}^d |\theta_j| \leq R\}$
Last time: $E||\hat{\theta}-\theta|^*||_2 \leq \sigma R \sqrt{\frac{\log d}{n}}$ slow rate.
> Martin started drawing the diamond shape l1-ball, if $\theta$ is at a vertex (sparse), a lot of contraint. See Figure 7.2 for something similar.
Will show: $\hat{\theta} = argmin_{\theta \in R^d, ||\theta||_1 \leq R} ||y-\theta||_2, ||\theta^*||_0 \leq k, ||\theta^*||_1 = R$
If on the boundary, will get a fast rate. $E[||\hat{\theta} - \theta^*||_2^2] \leq \frac{\sigma^2 k \log d}{n}$