02/16 Lecture 4: Tail bounds for more general functions

# 02/16 Lecture 4: Tail bounds for more general functions ###### tags: `Yan` [toc] ## 1. Tail bounds on general functions (_not found in textbook_) So far, we have talked about sum of independent ramdom variables $$f(X_1, ..., X_n) = \frac 1 n\sum_{i=1}^n X_i$$ where $X_i$ are independent and $f$ take sum. **<u>Question</u>: Can we move to more general functions f?** **Example**: $$f(X_1, .., X_n) = \max_{i=1, ...,n} X_i \sim N(0, \sigma^2)$$ > In homework #1, we need to show $\mathbb{E}[f(X)] \leq \frac x {\sqrt{2\sigma^2 \log n}}$. > In addition, $(f(X)-\mathbb{E}[f(x)]) \sim$ sub-Gaussian($\sigma$), which means that the tail is like $N(0, \sigma^2)$. **Example**: *(Rademacher and Gaussian complexity)* $A \subset \mathbb{R}^n$ bounded > measure the size of set by randomness $Z(A) = \sup_{a \in A, i=1} \sum_{i=1}^m \epsilon_ia_i$, $\epsilon_i$ i.i.d. and $P(\epsilon_i = +1)=P(\epsilon_i = -1)=\frac 1 2$. Rademacher complexity is $R(A)=\mathbb{E}[Z(A)]$. ## 2. Bounded differences inequality (_Section 2.2 Corollary 2.21_) > Azuma-Hoeffding (dependent variables and general Gaussian structure) → (independence + sums) Hoeffding → (bounded difference property) bounded difference inequality > Martin is not covering Azuma-Hoeffding during lecture, but he recommended reading about it. It's in textbook corollary 2.20. Given $X = (X_1, ..., X_n)$, $\tilde{X}^k:=(X_1, ..., X_{k-1}, \tilde{X_k}, ..., X_n)$. > Note from Yan: Basically one of the entries $X_k$ is changed in the vector into $\tilde{X_k}$. With this notion, $f: \mathbb{R}^n \rightarrow \mathbb{R}$ satisfies the **bounded different inequality** with parameters $(L_1, \dots, L_n)$ if, for each index $k=1, 2, ..., n$, $$|f(X)-f(\tilde{X^k})| \leq L_k \ \ \ \ \ \forall X, \tilde{X} \in \mathbb{R}^n$$ Bounded difference for $f(X_1, .., X_n) = \max_{i=1, ...,n} X_i$, say $|X_k| \leq b_k$, $L_k=\frac {2b_k}{n}$ > Note from Yan: Intuitively the bound happens during the worst case is when $X_1, ..., X_{k-1}, X_{k+1}, ..., X_n = -b_k$ and $X_k$ is changed from $b_k$ to $-b_k$. **Corollary 2.21 (Bounded Differences Inequality)**: For $X_k$ independent, if $f$ satisfying $BD(L_k) \ \forall k=1, ..., n$, and that the random vectors $X = (X_1, \dots, X_n)$ has independent components, then: $$P(|f(X)-\mathbb{E}[f(X)]| \geq t) \leq 2e^{-\frac{2t^2}{\sum_{k=1}^n L_k^2}} \ \ \ \forall t \geq 0$$ > Behaving like sub-Gaussian, $\frac{\sqrt{\sum_{k=1}^n L_k^2}}{2}$. Don't depend on one coordinate too strongly. > Note from Yan: He didn't prove it during class. It is probably because the proof involves martingale difference sequence, which was not covered during class. > Application in graph theory (_textbook 2.24_) clique number in random graph: (*Notes from Yan: Include some details in here bc I'm interested*) The *Erdos-Renyi* ensemble of random graphs says that: define a parameter $p$ := the probability that an edge is included in the graph independently across $d \choose 2$ edges, where $d$ is the size of the vertex set $V = \{1, \dots, d\}$. Then we could define the clique number to be $C(G)$. Let $G'$ and $G$ be two graph that only differs by one edge, it is easy to see that $C(G)$ and $C(G')$ can most differ by 1. Then it satisfies the bounded dfference property such that $$\mathbb{P}[\frac 1 n |C(G)-\mathbb{E}[C(G)]| \geq \delta] \leq 2e^{-2n\delta^2}$$ **Example** (Rademacher complexity concentration) Let $\{\epsilon_k\}_{k=1}^n$ be an i.i.d. sequence of Rademacher variables. > Notes from Yan: Rademacher variables take values -1 or +1 with probability 1/2. Given a collection of vector $A \in \mathbb{R}^n$. Define the function such that $$f(\epsilon_1, ..., \epsilon_i) = \sup_{a \in A, i=1} \sum_{i=1}^m \epsilon_ia_i$$ The expectation $R(A) := \mathbb{E}[f]$ is the **Rademacher complexity** of the set $A$. Then, we claim that such $f$ satisfies bounded difference inequality. Consider the following equation: $$f(\epsilon)-f(\tilde{\epsilon}^k) = \sup_{a \in A, i=1} \sum_{i=1}^m \epsilon_ia_i-\sup_{a \in A, i=1} \sum_{i=1}^m \tilde{\epsilon_i}a_i $$ $\tilde{\epsilon_i} = \epsilon_i \ \ \forall i \neq k$. Let $\sup_{a \in A, i=1} \sum_{i=1}^m \epsilon_ia_i = \epsilon^T a^*$ Then, $$f(\epsilon)-f(\tilde{\epsilon}^k) = \sup_{a \in A, i=1} \sum_{i=1}^m \epsilon_ia_i-\sup_{a \in A, i=1} \sum_{i=1}^m \tilde{\epsilon_i}a_i \\ \leq \epsilon^Ta^*-\tilde{\epsilon}^Ta^* = (\epsilon_k - \tilde{\epsilon}_k)a_k^* \leq 2*\sup_{a \in A}|a_k|=L_k$$ > Note from Yan: Here are some notations that I tried to decipher. Hope they could clear some confusions. - $\epsilon$ vs. $\epsilon_i$: $\epsilon_i$ is an individual Rademacher variable and $\epsilon$ is the vector containing all these $\epsilon_i$. - $\epsilon_k$ and $\tilde{\epsilon}_k$: they are also vectors like $\epsilon$ and follow the bounded difference notation - $a^*$ vs. $a_i$ vs. $a_k^*$: $a_i$ is a vector drawn from set $A$, $a^*$ is a matrix whose columns are $a_i$, $a_k^*$ is the kth column. **Example**: (U-statistics of order 2) (_textbook 2.23_) A **pairwise U-statstics** is defined as $$U=\frac{1}{n \choose 2} \sum_{i=1}^n \sum_{j=1, j < i}^m g(X_i, X_j)$$ for some $g: R^2 \rightarrow R, g(s, t)=g(t, s)$ > Notes from Yan: Intuitively I think of it as the average distance between two points with some distance metric g. $$E[U]=E_{X, X'}[g(X_1, X_2)]$$ > For instance, if $g(s, t) = |s-t|$, then $E[U] = E_{X_1, X_2}[|X_1-X_2|]$ If we impose the condition that $|g|_\infty \leq b$, we could obtain $BD$ with $L_k=\frac{4b}{n}$. (_Textbook Example 2.23_) ## 3. Lipschitz functions of Gaussian variables *(Setion 2.3)* > Here we present the classical result on the concentration properties of Lipschitz functions of Gaussian variables. These functions exhibit a particularly attractive form of *dimension-free* concentration. **Def**: A function $f:\mathbb{R}^n \rightarrow \mathbb{R}$ is **L-Lipschitz** w.r.t. some metric $g$ if: $$|f(X)-f(\tilde{X})| \leq Lg(X, \tilde{X})$$ If we use the Euclidean norm $g(X, \tilde{X}) = \|X-\tilde{X}\|_2$, the following result holds: **Theorem**: Supose $f:\mathbb{R}^n \rightarrow \mathbb{R}$ is L-Lipschitz in $l_2$: $|f(x) - f(\tilde{X})| \leq L|X-\tilde{X}|_2 \ \ \ \forall X, \tilde{X} \in R^N$, then we have $$E[e^{\lambda(f(x)-E[f(x)])}] \leq e^{\frac{\lambda^2 L^2}{2}} , \text{where } X_i \sim N(0, 1)$$ > A mgf of a Gaussian with variance L^2 for tail behavior. **Examples** for Lipschitz functions: * (a) $f(X_1, ..., X_n) = \max_{i=1, ..., n} X_i$ is 1-Lipschitz because the worst case after switching the sign the max is still the max. $$E[f(x)] \approx \sqrt{2\sigma^2\log n}$$ * (b) $f(X_1, ..., X_n) = |X|_2$, is also a 1-Lipschitz function $$P(|X|_2-E[|X|_2]| \geq t) \leq 2e^{\frac{-t^2}{2}} \forall t \geq 0$$ **Example** (Singular values of random matrices) (_Textbook 2.32_) $X \in R^{m *d}, X_ij \sim N(0, 1) i.i.d. \ \ \ n=md$ $$f(X) = \frac{|||X|||_2}{\sqrt(n)} = \max_{|u|_2 = 1, u \in R^d} \frac{||X_u||_2} {\sqrt{m}}$$ > max singular value of $\frac{X}{\sqrt{m}}$ What's the Lipschitz constant? $\frac 1 {\sqrt{m}}$ _Proof_: $f(X) - f(\tilde{X}) = \frac {1}{\sqrt{m}} {|||X|||_2 - |||\tilde{X}|||_2}$ > Need to study: operator norm, linear norm, nuclear norm. > Triangle inequality reversed $\leq \frac {1}{\sqrt{m}} |||X-\tilde{X}|||_2$ Like to show that $f(X) - f(\tilde{X}) \leq \frac {1}{\sqrt{m}} \sqrt{\sum_{i=1}^m\sum_{j=1}^d (X_{ij} - \tilde{X}_{ij})^2}$ Frobenius norm > Leave for exercise: $|||X-\tilde{X}|||_2<|||X-\tilde{X}|||_F$ $P(\frac{||X||_2}{\sqrt{m}} \geq \frac{E[|X|_2]}{\sqrt{m}} +t) \leq e^{\frac{-mt^2}{2}} \forall t \geq 0$ Why rescaling by $\sqrt{m}$? Sample covariance $\hat{\Sigma} = \frac{X^TX}{m}$ With this rescaling, it's behavioring very close to Gaussian > q: What's the expectation? a: $E[\frac{||X|||_2}{\sqrt{m}}] \leq 1 + \sqrt{\frac d m}$ (aspect ratio) Strongly recommend trying this out in numpy. First develop concentration, then find the expectation. $X = (X_1^T ... X_n^T) X_j \sim N(0, I) \\ \hat{\Sigma} = 1/m \sum_{i=1}^m X_iX_i^T$ Next time he wants to do more proofs.