02/14 Lecture 3: Sub-exponential variables, Bernstein and Randomized dimensionality reduction

# 02/14 Lecture 3: Sub-exponential variables, Bernstein and Randomized dimensionality reduction ###### tags: `Yan` [toc] ## 1. Sub-exponential bounds ### A) Motivating application (_Example 2.12_) Given a set of vectors, or dataset, $\{u^1, ..., u^N\}, u^j \in R^d$, $N =$ the number of vectors, we construct a mapping $F: R^d \rightarrow R^m, m<<d$. Then, we have $\{F(u^1), F(u^N)\}, F(u^j) \in R^m$, $m =$ embedding dimension. > **Why do we want to reduce the dimension?** Storage and computing will be much cheaper. We could also produce better visualization. Ideally, this reduction should preserve some "essential" features of the dataset. For example, we could consider preserving relative distance. **$\delta$-faithful $l_2$ embedding condition**: $$1-\delta \leq \frac{||F(u^i)-F(u^j)||}{||u^i-u^j||_2^2}_2^2 \leq 1+\delta \ \ \ \ \text{ for all pairs } u^i \neq u^j.$$ $\delta \in (0, 1)$ is the tolerance. In words, the projected dataset preserves all pairwse squared distances up to a multiplicative factor of $\delta$. It turns out that we could construct a random linear mapping s.t. $m > \frac{16}{\delta^2} \log(\frac{N}{\epsilon})$ such that it will satisfy the above property with probability $\geq 1-\epsilon$. > $m$ has dependence on $N$ but not $d$, which means that this reduction is actually dimension independent. It might not be ideal because it's oblivious to the characteristics of the data set itself unlike other techniques like PCA, but it also means that it is applicable for any kind of data set. > Martin: Let's eat our broccoli 🥦 and learn about the bounds now. ### B) Sub-exponential bounds Recall our definition for **$(\nu, \alpha)$-sub-exponential**: Given R.V. $X, \mathbb{E}[X] = \mu$, we have $$\mathbb{E}(e^{\lambda(X-\mu)}) \leq e^{\frac{\lambda \nu^2}{2}} \ \ \ \ \forall |\lambda| \leq \frac{1}{\alpha}$$. **Propostion**: (_Proposition 2.9_) For $(\nu, \alpha)$-sub-exponential: $$\mathbb{P}(X-\mu \geq t) = \begin{cases} e^{\frac{-t^2}{2\nu^2}} & t \in [0, \frac{\nu^2}{\alpha}] \\ e^{\frac{-t}{2\alpha}} & t > \frac{\nu^2}{\alpha} \end{cases} $$ _**Proof**_ (skipped in class) First, we combine the definition of sub-exponential and Chernoff's bound: > Recall Chernoff's bound is $\mathbb{P}[X - \mu \geq t] \leq e^{-\lambda t} \mathbb{E}[e^{\lambda (X-\mu)}]$. > While the definition of sub-exponential states that $\mathbb{E}[e^{\lambda (X-\mu)}] \leq exp(\frac{v^2 \lambda^2}{2})$. WLOG, assume $\mu = 0$. Together we have: $$\mathbb{P}[X\geq t] \leq e^{-\lambda t} \mathbb{E}[e^{\lambda X}] \leq exp(-\lambda t+\frac{\lambda^2\nu^2}{2}) \text{ for all } \lambda \in [0, \alpha^{-1}).$$ Let $g(t)=exp(-\lambda t+\frac{\lambda^2\nu^2}{2})$, we compute the quantity of $g^*(t) := \inf_{\lambda \in [0, \alpha^{-1})} g(\lambda, t)$. 1. When unconstrained, $g^*(x)$ is at minimum $-\frac{t^2}{2\nu^2}$ when $\lambda^*=\frac{t}{\nu^2}$. This is only valid when $t \in [0, \frac{\nu^2}{\alpha}]$ because $\lambda \leq \frac 1 \alpha$. 2. Otherwise, we have $t \geq \frac{\nu^2}{\alpha}$. Since the function $g(\cdot, t)$ is monotonically decreasing in the interval $[0, \lambda^*)$, the minimum is achieved at the boundary point $\lambda = \alpha^{-1}$, where we have $$g^*(t) = \frac t \alpha + \frac{1}{2\alpha} \frac{\nu^2}{\alpha} \leq -\frac{t}{2\alpha}$$ given that $t \geq \frac{\nu^2}{\alpha}$. > This class of sub-exponential random variable has a less clean tail bounds than that of sub-Gaussian's, but has a lot more power. For sub-Gaussian variables, it's very easy to do some operations and end up escaping the class. ### C) Preservation of sub-exponential property under summation **Fact** (left as exercise by Martin): Given $X_i \sim$ sub-exp($\nu_i, \alpha_i$) and i.i.d., $\sum_{i=1}^m X_i$ is sub-exp($\sqrt(\sum_{i=1}^n \nu_i^2)$, $\max_{i} \alpha_i$) > Intuition: $\nu$ adds up like a variance (sub-Gaussian), $\alpha$ takes the loosest bound. $P(\sum_{i=1}^n (X_i-\mu_i) \geq t) = e^{\frac{-t^2}{2\nu_*^2}}$ for $t \in [0, \frac{\nu_*^2}{\alpha_*}]$ (optimize for Chernoff bound), or $=e^{\frac{-t}{2\alpha_*}}$ for $t > \frac{\nu_*^2}{\alpha_*}$ (boundary point). _**Proof**_ (_end of p. 28 & beginning of p.29_) $$\mathbb{E}[e^{\lambda \sum_{i=1}^n (X_i-\mu_i)}] = \mathbb{E}[\Pi_{i=1}^n e^{\lambda(X_i-\mu_i)}]= \Pi_{i=1}^n \mathbb{E}[e^{\lambda(X_i-\mu_i)}] \leq \Pi_{i=1}^n e^{\lambda^2 \nu_i^2/2} \leq e^{\sum_{i=1}^n \lambda^2 \nu_i^2/2} = e^{\lambda^2 \sum_{i=1}^n \nu_i^2/2}$$ for $\lambda_i < \frac 1 {\alpha_i}$. Since we want to find some $\alpha^*$ such that $\forall i, \lambda_i < \frac 1 {\alpha^*}$, we take the tightest bound, which is minimum $\frac 1 {\alpha^*}$, which should be $\alpha^* = \max_i \alpha_i$. Therefore, $\sum_{i=1}^m (X_i - \mu_i)$ is sub-exp($v_* =\sqrt{(\sum_{i=1}^n \nu_i^2)}$, $\alpha_* = \max_{i} \alpha_i$) By *Proposition 2.9 (Sub-exponential tail bound)*, We have the upper tail bound *(equation 2.18)* $$\mathbb{P}(\frac{1}{n} \sum_{i=1}^n (X_i-\mu_i) \geq t) \leq \begin{cases}e^{\frac{-nt^2}{2(\nu_*^2/n)}} & t \in [0,\frac{\nu_*^2}{n_*\alpha}]\\ e^{-\frac{nt}{2\alpha_*}} & t > \frac{\nu_*^2}{n_*\alpha} \end{cases}$$ > Note from Yan: I worked out the proof and got the opposite conclusion for $\alpha$. **Special case**: i.i.d. all the same parameters. By definition of i.i.d. we have that $\nu_i = \nu, \alpha_i = \alpha, \mu_i = \mu$ which means that $\nu{}_{*} = \sqrt{n}\nu, \alpha_* = \alpha$. Allowing $t = n\delta$, we plug in to get $$\mathbb{P}(\sum_{i=1}^n (X_i-\mu_i) \geq t) \leq \begin{cases}e^{-n\delta^2/2\nu^2} & t \in [0,v^2/n\alpha]\\ e^{-(n\delta)^2/2\alpha} & t > v^2/n\alpha \end{cases}$$ **Example 2.11 (pg 29)**: $\chi^2$-variable $Z=\sum_{k=1}^n X_k^2, X_k \sim N(0,1)$ i.i.d., $\mathbb{E}[Z]=n$. Last time: $X_k^2 \sim$ sub-exp$(2, 4)$, the sum of them, $Z$, is (by the fact above) sub-exp$(2\sqrt n, 4)$, $$P(|(\frac{1}n \sum_{k=1}^n x_k^2)-1| \geq \delta) \leq 2e^{\frac{-n\delta^2}{8}}, \delta \in [0, 1] \, (*)$$ (will be used later!). > High dimension Gaussian is like a ball 🏀 with radius $\sqrt n$ and all the mass is on a thin shell with outer radius $\sqrt n(1+\delta)$ and inner radius is $\sqrt n(1-\delta)$. The probability is given above. > Asked: Is it like CLT on high dimension? Answer: kinda related, CLT says the sum is like Gaussian, this says the tail is like Gaussian. Assumption of this is stronger but conclusion is not asymtotic (asym or not asym? Didn't catch it). ## 2. Bernstein and consequences ### A) **Bernstein's condition** > **Motivation:** Sub-exponential property can be verfied by explicitly computing bounding MGF, but it may not be practicable in many settings. Bernstein's condition is an alternative approach that holds if $X$ is bounded. (_Main result: Bernstein's condition, Equation 2.15_) Given a random variable $X$ with $\mu = E(X)$, $\sigma^2 = Var(X)$ with some parameter $b$, a Bernstein condition $BC(\sigma, b)$ holds if $$|E[(X-\mu)^k]| \leq \frac 1 2k!\sigma^2b^{k-2}, k=2,3, 4... (\text{equation } 2.15)$$ **Idea**: If $|X-\mu|\leq b$, then it is straightforward to verify the above holds: $$E[(X-\mu)^k] \leq E[|X-\mu|^{k-2}|X-\mu|^2] \leq \sigma^2 * b^{k-2}$$. Bernstein's condition has link to MGFs ($E[e^{\lambda(X-\mu)}] = \sum_{k=0}^\infty \frac{\lambda^kE[(X-\mu)]^k}{k!}$) > This condition basically requires the moment generating functions to not blow up too fast. ### B) Bernstein-type bound **Main result: Proposition 2.10** (Bernstein-type bound): For any R.V. satisfying the Bernstein condition (equation 2.15), $BC(\sigma, b)$, we have *(equation 2.17a)* $$\mathbb{E}[e^{\lambda (X-\mu)}] \leq e^{\frac{\lambda^2 \sigma^2/2}{1-b|\lambda|}} \ \ \ \ for \ all \ |\lambda|<\frac 1 b,$$ and, moreover, the concentration inequality *(equation 2.17b)* $$P(X-\mu \geq t) \leq exp\{-\frac{t^2}{2(\sigma^2+bt)}\}$$ > $(\leq \max\{e^{\frac{-t^2}{2\sigma^2}}, e^{-\frac{-t}{2b}}\})$ > > _He erased the () part and make it two regions._ > _Sasha said the max is correct._ > _Martins thinks he's spiritually correct but not sure about the detail._ > Special case of sub-exponential > Truncating tails if the tails is vanishing quickly. ### C) **Remark** _Q: Why is Bernstein better than Hoeffding if a bounded variable, i.e. $|X-\mu| \leq b$ is also sub-Gaussan with parameter $b$?_ **Example of a _bounded variable_:** Given $X$, $P(X=1) = \mu, P(X=0) = 1-\mu$. Then $\mu \in [0, 1]$. $\sigma^2=\mu(1-\mu)$ which is maximized at $\mu=\frac 1 2$ and $\sigma^2 \leq \frac 1 4$. Let $b=1 \ \ (|X| \leq 1)$. **Hoeffding**: $P(|X-\mu| \geq t) \leq 2e^{-2t^2}$. for sub-Gaussian($\frac 1 2$) (not capturing the finer details) **Bernstein**: $P(|X-\mu| \geq t) \leq 2e^{\frac{-t^2}{2\mu(1-\mu)+2t}}$. As $\mu \rightarrow 1/0$, this bound is very very tight. (it has the variance information) > With Bernstein bound, variables often converges with slow rate/fast rate. The slow rate usually contains the variance information at the bottom of the exponent. However, if variance is small, slow rate is actually the dominating rate and is tighter. ## 3. Randomized dimensionality reduction *(Example 2.12)* **Def**: a random matrix $X \in R^{m\times d}$, filled with i.i.d entries $X_{ij} \sim N(0, 1)$. ($d$ =ambient dimension, $m$ = embedding dimension, $m << d$) The idea of dimensionality reduction is we could construct a linear mapping $F:R^d\rightarrow R^m$, $$F(u)=\frac{Xu}{\sqrt m}$$ for data vector $u \in R^d$, so some "essential" features of the data set are preserved - in this case, the pairwise distances (or equivalently norms and inner products). > _$X$ is data oblivious, meaning that its value doesn't depend on the actual data._ >**Rescaling by $\sqrt(m)$?** $\mathbb{E}[||F(u)||_2^2] = u^T \mathbb{E}[\frac{X^TX}{m}]u=||u||_2^2$ Now, we check that this mapping satisfies the $\delta$-faithful $l_2$ embedding condition: Denote $\Delta=\frac{u_i-u_j}{||u_i-u_j||_2}$, then $$\frac{||F(u^i)-F(u^j)||_2^2}{||u^i-u^j||_2^2} = \frac{1}{m}||X(\frac{u_i-u_j}{||u_i-u_j||_2})||^2_2=\frac{1}{m}\sum_{i=1}^m(x_i^T\Delta)^2$$ This is a $\chi^2$-variable because $x_i^T\Delta \sim N(0, 1) \ \ i.i.d.$ (_Given that $X_{ij}$ i.i.d._) $P(|\frac 1 m \sum_{i=1}^m (x_i^T\Delta)^2 - 1| \geq \delta) \leq 2e^{-m\delta^2/8} \ \ \ \forall \delta \in (0,1)$ from (*) the $\chi^2$ tail bound we had before. Note that we are presesrving pairwise distances, so there are ${N \choose 2}$ distinct pairs of data points. Applying the union bound, we conclude $$P( \text{not }\delta-\text{faithful uniformly}) \leq 2{N \choose 2} e^{\frac{-m\delta^2}{8}} = \epsilon \ \ \ \text{for all } \delta \in (0, 1)$$ This probability can be driven below $\epsilon$ by choosing $m > \frac{16}{\delta^2} \log(N/\epsilon)$.