# 03/02 Lecture 7: Cont. on Contraints on Least Square ###### tags: `Yan` ## Outline 1. $l_0$-constraints for sparsity (Chap. 7, 7.1) 2. Fast rates for $l_1$-relaxation (Chap. 7, 7.1) 3. Touch on ellipses (didn't get to) > **Before lecture:** Martin: hw are meant to be hard. Don't be discouraged if u find it challenging. Hope it's not challenging in an annoying way and actually interesting. `they are thinking very carefully about the problems` (not like they change the problem the day when it's due. Anyway) If you find the homework too easy or too hard, you don't belong to this class :anguished: We got an additional TA called Williams for the class, so more oh (Yay) Martin: I'm happy to give you hint, but won't hold your hand 😲 ## Constraint Least Squares $H \in R^d$, $H$ compact $$y = \Theta^* + \frac \sigma {\sqrt{n}} w, \Theta^* ]in H, w \in R^d$$ $\hat{\Theta} \in argmin_\Theta ||y-\Theta||_2$. ### Basic inequality $||\hat{\Delta}||_2^2 \leq \frac{\sigma}{\sqrt{n}} <w, \hat{\Delta}>$ Deterministic (Stability analyhsis in optimization) ### Concentration (HW#2, $w \sim N(0, I_{d*d}$)) Show $$||\hat{\Delta}-\Theta^*||_2^2 \leq c \{E[||\hat{\Delta}-\Theta^*||_2^2] + \frac{\sigma^2}{\sqrt{n}}\log (1/\delta)\}$$ with probability at least $1-\delta$. > Big picture is that the estimation is close to expectation with high probability $1-\delta$. The expectation is representative of the typical behavior. > Martin tried to make an analogy to lottery but failed. ### Analysis of $E||\hat{\Delta}||_2^2$ ---- Example ($l_0$-ball) $$B_0(k) = \{\theta \in R^d | \ ||\theta||_0 \leq k \sum_{j=1}^d I(\theta_j \neq 0) \leq k\}$$ > $\hat{\Delta}$ is very dependent on $w$. $$|<\Delta, w>| \leq ||\hat{\Delta}||_1||w||_\infty$$ by Holder > l1 vs. l2 norms $$||\Delta||_1 \leq \sqrt{d} ||\Delta||_2$$ by Cauchy-Schwarz. In high-dim analysis, we care about d since it could be high. $$|<\Delta, w>| \leq ||\hat{\Delta}||_1||w||_\infty$$ $\hat{\Theta}, \Theta^*$ is at most k-sparse, so $\hat{\Delta}$ is 2k-sparse at most. So we get $$\frac 1 2 ||\Delta||_2^2 \leq ||\Delta||_1 ||w||_\infty \leq \sqrt{2k} ||\Delta||_2 ||w||_\infty$$. Then $$E[||\Delta||_2^2] \leq 2\sqrt{2k} E||w||_\infty$$ Eq: $w_i$ is 1-sub-Gaussian, zero-mean From HW#1, we have that $E||w||_\infty \leq 2\sqrt{2 \log d}$ $$E||\hat{\Delta}||_2 \leq c \sigma \sqrt{\frac{k \log d}{n}}$$ Can show $E||\hat{\Delta}||_2 \leq c' \sigma^2 \frac{k \log d}{n}$ > If no structure on $\Theta^*$, we can prove $\frac{\sigma^2 d}{n}$ Is our new bound better? If $\Theta$ is sparse, i.e. k<<d, good! Otherwise i.e. k=O(d), not good! > technique >> result. tho result not bad > examples of sparse vectors: music, image processing > soft sparse $$B_q(R) = \{\theta \in R^d | \sum_{j=1}^d ||\theta_j||^q \leq R\}$$ ---- Holder is pretty crude. $$|<\hat{\Delta}, w>| = ||\hat{\Delta}||_2 <\hat{\Delta}/||\hat{\Delta}||_2, w>$$ We know $<\hat{\Delta}/||\hat{\Delta}||_2, w> \in B_0(2k) \in B_2(1)$ $$\max_{a \in B_0(2k), ||a||_2 = 1} <a, w> = \max_{|S|=2k} ||w_s||_2 = \max_{|S|=2k} \sum_{j \in S} w_j^2$$ $number \ of \ sets = n = {d\choose 2k}$ because 2k sparsity. $$E[\max_{|S|=2k} ||w_S||_2||] = E[\max_{|S|=2k} ||w_S||_2|| - E[||w_S||_2]] + E[||w_S||]_2$$ When w standard Gaussian, $||w_S||_2|| - E[||w_S||_2]$ is 1-sub-Gaussian $E[||w_S||]_2 \leq \sqrt{|S|} = \sqrt{2k}$ HW#1 said that the max of sub-Gaussian variable $E[\max_{|S|=2k} ||w_S||_2|| - E[||w_S||_2]] \leq \sqrt{2 \log {d \choose 2k}} \leq c\{\sqrt{k \log (\frac{ed}{k})} + \sqrt{k} \}$ > Compare to the previous result, this is sharp and cannot be improved. This bound applies when the data is not sparse. Using a more careful analysis, we get better result. > Martin wants his twins in class everyday. Kids are tiring but when they're gone he missed them. ---- $$z = \max_{a \in A \ finite} X_a, E[X_a] = 0$$ $X_a \sigma$-sub-Gaussian Then, we had $E[z] \leq \sqrt{2\sigma^2 \log |A|}$. $\lambda E[z] = \log (e^{\lambda E[z]}) \leq \log (E[e^{\lambda z}]), \lambda > 0$, second inequality by Jensen $E[e^{\lambda z}] \leq \sum_{a \in A} E[e^{\lambda X_a}]$ (max leq sum) > Martin says it's not a good move, but i use it all the time :( $E[e^{\lambda z}] \leq \sum_{a \in A} e^{\frac{\lambda^2\sigma^2}{2}} = |A|e^{\frac{\lambda^2\sigma^2}{2}}$. $$\lambda E[z] \leq \log |A| + \frac{\lambda^2\sigma^2}{2}$$ Then rescale for $\lambda$ $$ E[z] \leq \log |A|/\lambda + \frac{\lambda \sigma^2}{2} \ \forall \lambda > 0$$ > "OMG Do I need to get a sharp constant?" -- Martin Wainwright Choose $\lambda = \sqrt{\frac{2\log |A|}{\sigma^2}}$, then $$ E[z] \leq \log |A|/\lambda + \frac{\lambda \sigma^2}{2} = 2\sqrt{\frac{\sigma^2 \log |A|}{2}}$$ is what we wanted. ## Fast rate on l1-ball > Application in Adaptation Example: $B_1(R) = \{\theta \in R^d | \sum_{j=1}^d |\theta_j| \leq R\}$ Last time: $E||\hat{\theta}-\theta|^*||_2 \leq \sigma R \sqrt{\frac{\log d}{n}}$ slow rate. > Martin started drawing the diamond shape l1-ball, if $\theta$ is at a vertex (sparse), a lot of contraint. See Figure 7.2 for something similar. Will show: $\hat{\theta} = argmin_{\theta \in R^d, ||\theta||_1 \leq R} ||y-\theta||_2, ||\theta^*||_0 \leq k, ||\theta^*||_1 = R$ If on the boundary, will get a fast rate. $E[||\hat{\theta} - \theta^*||_2^2] \leq \frac{\sigma^2 k \log d}{n}$