# 03/09: Lecture 9: Random Matrices, Estimating Covariance Matrices ###### tags: `Yan` 1. Lasso ($l_1$-reg least square) and restricted eigenvalues 2. Low-rank matrix estimation, noise operator norm > In HW2, we need to do low-matrix estimation and compute some norm > Need techniques: for controlling random matrices ## Lasso and restricted eigenvalues ---- **Review:** $\hat{\theta}^\lambda = \argmin_{\theta \in R^d} \{\frac{1}{2n} ||y-X\theta||_2^2 + \lambda||\theta||_1\}$. > $(y_1, ..., y_n)^T = (x_1, ..., x_n)^T\theta^* + \sigma w$ Let $F(\theta) = \frac{1}{2n} ||y-X\theta||_2^2 + \lambda||\theta||_1$, then $F(\hat{\theta}^\lambda) \leq F(\theta^*)$ from basic inequality. Then, $$0 \leq \frac{1}{2n} ||X\hat{\Delta}||_2^2 \leq \sigma<\hat{\Delta}, \frac{X^Tw}{n}> + \lambda \{||\theta^*||_1-||\theta^*+\hat{\Delta}||_1\}$$ If we care about $\hat{\Delta} = \hat{\theta} - \theta^*$. Last time, we argued that $\lambda \{||\theta^*||_1-||\theta^*+\hat{\Delta}||_1 \leq \lambda\{||\hat{\Delta}_S||_1 - ||\hat{\Delta}_{S^c}||_1 \}$ With Holder's, we have $$\sigma<\hat{\Delta}, \frac{X^Tw}{n}> \leq ||\hat{\Delta}||_1 \sigma ||\frac{X^Tw}{n}||_\infty$$ Assume that $\sigma ||\frac{X^Tw}{n}||_\infty \leq \frac \lambda 2$ by making of choice on $\lambda$. The error will belong to this cone $||\hat{\Delta}_{S^c}||_1 \leq 3 |\hat{\Delta}_S|_1 = C_3(S)$ C for cone, 3 for the constant. ---- We want a lower bound on $\frac{1}{2n} ||X\hat{\Delta}||_2^2$ in terms of $l_2$-norm. After algebra (Section 7.2): $$\frac{1}{2n} ||X\hat{\Delta}||_2^2 \leq c\lambda \sqrt{k} ||\hat{\Delta}||_2$$ $k=|S|$ is the sparsity. $S = \{j | \theta_j^* \neq 0\}$. Want $$\frac{1}{n} ||X\Delta||_2^2 \geq \gamma_S ||\Delta||_2^2$$ for some $\gamma_S > 0$ for all $\Delta \in C_3(S)$ because we then get $||\hat{\Delta}||_2^2 \leq c'k\lambda^2/\gamma^2$. Unpack: $$\frac{1}{n} ||X\hat{\Delta}||_2^2 = \frac 1 n \Delta^TX^TX\Delta = \Delta^T\hat{\Sigma}\Delta$$ where $\hat{\Sigma}$ is the sample covariance matrix. It's psd. $$= \frac 1 n \sum_{i=1}^n <x_1, \Delta>^2$$ $$\min_{||\Delta||_2} = \frac{||X\Delta||_2^2}{n} = \min_{||\Delta||_2 = 1} \Delta^T\hat{\Sigma}\Delta = \lambda_{min}(\hat{\Sigma})$$ Restricted Eigenvalue $\gamma = \min_{||\Delta||_2 = 1, \Delta \in C_3(S)} \Delta^T\hat{\Sigma}\Delta \geq \lambda_{min}(\hat{\Delta})$ ---- **Theorem** (Uniform Law): Say $x_i \sim N(0, \Sigma), i = 1, ..., n, i.i.d.$, then with high probability: - $\hat{\Sigma} = \frac 1 n \sum_{i=1}^n x_ix_i^T \Delta^T \hat{\Sigma} \Delta$ - $\Delta^T \hat{\Sigma} \Delta = \frac 1 n \sum_{i=1}^n <x_1, \Delta>^2$ - $\Sigma = E(x_1x_1^T)$ - $ \Delta^T\Sigma \Delta = E[<x_1, \Delta>^2]$ $$|\Delta^T \hat{\Sigma} \Delta - \Delta^T\Sigma \Delta| \leq c (\max_{j} \Sigma_{jj}) ||\Delta||_1^2 \frac{\log d}{n} \ \ \ \forall \Delta \in R^d$$ Special case: $\Sigma = I_{d*d}$ $X=(x_1, ..., x_n)^T$ $$1 - c \frac{||\Delta||_1^2}{||\Delta||_2^2} \frac{\log d}{n} \leq \frac 1 n \frac{||X\Delta||_2^2}{||\Delta||_2^2} \leq 1 + c \frac{||\Delta||_1^2}{||\Delta||_2^2} \frac{\log d}{n} \ \ \ \forall ||\Delta||_2 = 1$$ $c \frac{||\Delta||_1^2}{||\Delta||_2^2} \frac{\log d}{n} \approx \epsilon$. $\frac{||\Delta||_1^2}{||\Delta||_2^2}$ is not interesting for dense vector because it could be as large as $d$. But we are interested in $\frac{||\Delta||_1^2}{||\Delta||_2^2} \leq 16k$ because $||\Delta_{S^c}||_1 \leq 2||\Delta_S||_1$. So we have $$\frac 1 n \frac{||X\Delta||_2^2}{||\Delta||_2^2} \leq 1 + c 16k \frac{\log d}{n}$$ ## Low-rank Matrix Estimation $Y = H^* + \frac{\sigma}{\sqrt{n}}W \in R^{d1 * d2}$. $H$ is some unknown matrix. We make the assumption that the matrix is low-rank. $rank(H^*) = r$ > Motivation: PCA. Recommender system > Aside: $Y = H^* + \Gamma^* + \frac{\sigma}{\sqrt{n}}W$, $\Gamma^*$ is row-sparse. Making robust PCA/SVD. "Soft"-low-rank: $H = UD(\sigma_1(H)...0\\0 \sigma_2(H), ... \\ ... \\ ... \sigma_{d_1}(H))V^T, d_1 \leq d_2, H \in R^{d_1 * d_2}$ $\sigma(H) = \sigma_1(H), ..., \sigma_{d_1}(H)$ is the vector of singular values. $rank(H) = r \rightarrow ||\sigma(H)||_0 = r$ $B_1(R) = \{H \in R^{d_1 * d_2} | \sum_{j=1}^{d_1} |\sigma_j(H)|^q \leq R\}$ - q = infinity: operator norm - q = 2: Forbenius norm - q = 1: nuclear - q in (0, 1): non-convex > In HW2: Basic Inequality: $trace(\hat{\Delta}W) \leq f(\Delta) ||W||_op, \hat{\Delta} = \hat{H} - H^*$ $||W||_op = \max_j |\sigma_j(W)|$