# 03/09: Lecture 9: Random Matrices, Estimating Covariance Matrices
###### tags: `Yan`
1. Lasso ($l_1$-reg least square) and restricted eigenvalues
2. Low-rank matrix estimation, noise operator norm
> In HW2, we need to do low-matrix estimation and compute some norm
> Need techniques: for controlling random matrices
## Lasso and restricted eigenvalues
----
**Review:**
$\hat{\theta}^\lambda = \argmin_{\theta \in R^d} \{\frac{1}{2n} ||y-X\theta||_2^2 + \lambda||\theta||_1\}$.
> $(y_1, ..., y_n)^T = (x_1, ..., x_n)^T\theta^* + \sigma w$
Let $F(\theta) = \frac{1}{2n} ||y-X\theta||_2^2 + \lambda||\theta||_1$, then $F(\hat{\theta}^\lambda) \leq F(\theta^*)$ from basic inequality.
Then,
$$0 \leq \frac{1}{2n} ||X\hat{\Delta}||_2^2 \leq \sigma<\hat{\Delta}, \frac{X^Tw}{n}> + \lambda \{||\theta^*||_1-||\theta^*+\hat{\Delta}||_1\}$$
If we care about $\hat{\Delta} = \hat{\theta} - \theta^*$. Last time, we argued that $\lambda \{||\theta^*||_1-||\theta^*+\hat{\Delta}||_1 \leq \lambda\{||\hat{\Delta}_S||_1 - ||\hat{\Delta}_{S^c}||_1 \}$
With Holder's, we have $$\sigma<\hat{\Delta}, \frac{X^Tw}{n}> \leq ||\hat{\Delta}||_1 \sigma ||\frac{X^Tw}{n}||_\infty$$
Assume that $\sigma ||\frac{X^Tw}{n}||_\infty \leq \frac \lambda 2$ by making of choice on $\lambda$.
The error will belong to this cone $||\hat{\Delta}_{S^c}||_1 \leq 3 |\hat{\Delta}_S|_1 = C_3(S)$ C for cone, 3 for the constant.
----
We want a lower bound on $\frac{1}{2n} ||X\hat{\Delta}||_2^2$ in terms of $l_2$-norm.
After algebra (Section 7.2):
$$\frac{1}{2n} ||X\hat{\Delta}||_2^2 \leq c\lambda \sqrt{k} ||\hat{\Delta}||_2$$
$k=|S|$ is the sparsity. $S = \{j | \theta_j^* \neq 0\}$.
Want
$$\frac{1}{n} ||X\Delta||_2^2 \geq \gamma_S ||\Delta||_2^2$$
for some $\gamma_S > 0$ for all $\Delta \in C_3(S)$ because we then get $||\hat{\Delta}||_2^2 \leq c'k\lambda^2/\gamma^2$.
Unpack:
$$\frac{1}{n} ||X\hat{\Delta}||_2^2 = \frac 1 n \Delta^TX^TX\Delta = \Delta^T\hat{\Sigma}\Delta$$ where $\hat{\Sigma}$ is the sample covariance matrix. It's psd.
$$= \frac 1 n \sum_{i=1}^n <x_1, \Delta>^2$$
$$\min_{||\Delta||_2} = \frac{||X\Delta||_2^2}{n} = \min_{||\Delta||_2 = 1} \Delta^T\hat{\Sigma}\Delta = \lambda_{min}(\hat{\Sigma})$$
Restricted Eigenvalue $\gamma = \min_{||\Delta||_2 = 1, \Delta \in C_3(S)} \Delta^T\hat{\Sigma}\Delta \geq \lambda_{min}(\hat{\Delta})$
----
**Theorem** (Uniform Law): Say $x_i \sim N(0, \Sigma), i = 1, ..., n, i.i.d.$, then with high probability:
- $\hat{\Sigma} = \frac 1 n \sum_{i=1}^n x_ix_i^T \Delta^T \hat{\Sigma} \Delta$
- $\Delta^T \hat{\Sigma} \Delta = \frac 1 n \sum_{i=1}^n <x_1, \Delta>^2$
- $\Sigma = E(x_1x_1^T)$
- $ \Delta^T\Sigma \Delta = E[<x_1, \Delta>^2]$
$$|\Delta^T \hat{\Sigma} \Delta - \Delta^T\Sigma \Delta| \leq c (\max_{j} \Sigma_{jj}) ||\Delta||_1^2 \frac{\log d}{n} \ \ \ \forall \Delta \in R^d$$
Special case:
$\Sigma = I_{d*d}$
$X=(x_1, ..., x_n)^T$
$$1 - c \frac{||\Delta||_1^2}{||\Delta||_2^2} \frac{\log d}{n} \leq \frac 1 n \frac{||X\Delta||_2^2}{||\Delta||_2^2} \leq 1 + c \frac{||\Delta||_1^2}{||\Delta||_2^2} \frac{\log d}{n} \ \ \ \forall ||\Delta||_2 = 1$$
$c \frac{||\Delta||_1^2}{||\Delta||_2^2} \frac{\log d}{n} \approx \epsilon$.
$\frac{||\Delta||_1^2}{||\Delta||_2^2}$ is not interesting for dense vector because it could be as large as $d$.
But we are interested in $\frac{||\Delta||_1^2}{||\Delta||_2^2} \leq 16k$ because $||\Delta_{S^c}||_1 \leq 2||\Delta_S||_1$.
So we have $$\frac 1 n \frac{||X\Delta||_2^2}{||\Delta||_2^2} \leq 1 + c 16k \frac{\log d}{n}$$
## Low-rank Matrix Estimation
$Y = H^* + \frac{\sigma}{\sqrt{n}}W \in R^{d1 * d2}$. $H$ is some unknown matrix. We make the assumption that the matrix is low-rank.
$rank(H^*) = r$
> Motivation: PCA. Recommender system
> Aside:
$Y = H^* + \Gamma^* + \frac{\sigma}{\sqrt{n}}W$, $\Gamma^*$ is row-sparse. Making robust PCA/SVD.
"Soft"-low-rank:
$H = UD(\sigma_1(H)...0\\0 \sigma_2(H), ... \\ ... \\ ... \sigma_{d_1}(H))V^T, d_1 \leq d_2, H \in R^{d_1 * d_2}$
$\sigma(H) = \sigma_1(H), ..., \sigma_{d_1}(H)$ is the vector of singular values.
$rank(H) = r \rightarrow ||\sigma(H)||_0 = r$
$B_1(R) = \{H \in R^{d_1 * d_2} | \sum_{j=1}^{d_1} |\sigma_j(H)|^q \leq R\}$
- q = infinity: operator norm
- q = 2: Forbenius norm
- q = 1: nuclear
- q in (0, 1): non-convex
> In HW2: Basic Inequality: $trace(\hat{\Delta}W) \leq f(\Delta) ||W||_op, \hat{\Delta} = \hat{H} - H^*$
$||W||_op = \max_j |\sigma_j(W)|$