# 02/16 Lecture 4: Tail bounds for more general functions
[toc]
## Thus far
We have talked about sum of independent ramdom variables
$$f(X_1, ..., X_n) = \frac 1 n\sum_{i=1}^n X_i$$
where $X_i$ are independent.
## Question: Can we move to more general functions f?
(_not found in textbook_)
**Example**: $f(X_1, .., X_n) = \max_{i=1, ...,n} X_i \sim N(0, \sigma^2)$.
In homework #1, we need to show $\mathbb{E}[f(X)] \leq \frac x {\sqrt{2\sigma^2 \log n}}$
In addition, $(f(X)-\mathbb{E}[f(x)]) \sim$ sub-Gaussian($\sigma$), which means that the tail is like $N(0, \sigma^2)$.
**Example**: (Rademacher and Gaussian complexity)
$A \subset \mathbb{R}^n$ bounded
> measure the size of set by randomness
$Z(A) = \sup_{a \in A, i=1} \sum_{i=1}^m \epsilon_ia_i$, $\epsilon_i$ i.i.d. and $P(\epsilon_i = +1)=P(\epsilon_i = -1)=\frac 1 2$.
Rademacher complexity is $R(A)=\mathbb{E}[Z(A)]$.
## Bounded differences inequality
(_2.2 second half_)
> Azuma-Hoeffding (dependent variables and general Gaussian structure)
→ (independence + sums) Hoeffding
→ (bounded difference property) bounded difference inequality
> Martin is not covering Azuma-Hoeffding during lecture, but he recommended reading about it.
Given $X = (X_1, ..., X_n)$, $\tilde{X}^k=(X_1, ..., X_{k-1}, \tilde{X_k}, ..., X_n)$.
> Note from Yan: Basically one of the entries is changed in the vecotr.
With this notion, $f: \mathbb{R}^n \rightarrow \mathbb{R}$ satisfies the **bounded different inequality** if, for each index $k=1, 2, ..., n$,
$$|f(X)-f(\tilde{X^k})| \leq L_k \ \ \ \ \ \forall X, \tilde{X} \in \mathbb{R}^n$$
Bounded difference for $f(X_1, .., X_n) = \max_{i=1, ...,n} X_i$, say $|X_k| \leq b_k$, $L_k=\frac {2b_k}{n}$
> Note from Yan: Intuitively the bound happens during the worst case is when $X_1, ..., X_{k-1}, X_{k+1}, ..., X_n = -b_k$ and $X_k$ is changed from $b_k$ to $-b_k$.
**Proposition**:
For $X_k$ independent, if $f$ satisfying $BD(L_k) \ \forall k=1, ..., n:$
$$P(|f(X)-\mathbb{E}[f(X)]| \geq t) \leq 2e^{-\frac{2t^2}{\sum_{k=1}^n L_k^2}} \ \ \ \forall t \geq 0$$
> Behaving like sub-Gaussian, $\frac{\sqrt{\sum_{k=1}^n L_k^2}}{2}$. Don't depend on one coordinate too strongly.
> Note from Yan: He didn't prove it during class. It is probably because the proof involves martingale difference sequence, which was not covered during class.
> Application in graph theory (_textbook 2.24_): clique number in random graph.
**Example** (Rademacher complexity concentration)
Let $\{\epsilon_k\}_{k=1}^n$ be an i.i.d. sequence of Rademacher variables.
> Notes from Yan: Rademacher variables take values -1 or +1 with probability 1/2.
Given a collection of vector $A \in \mathbb{R}^n$. Define the function such that
$$f(\epsilon_1, ..., \epsilon_i) = \sup_{a \in A, i=1} \sum_{i=1}^m \epsilon_ia_i$$
The expectation $R(A) := \mathbb{E}[f]$ is the **Rademacher complexity** of the set $A$.
Then, we claim that such $f$ satisfies bounded difference inequality. Consider the following equation:
$$f(\epsilon)-f(\tilde{\epsilon}^k) = \sup_{a \in A, i=1} \sum_{i=1}^m \epsilon_ia_i-\sup_{a \in A, i=1} \sum_{i=1}^m \tilde{\epsilon_i}a_i $$
$\tilde{\epsilon_i} = \epsilon_i \ \ \forall i \neq k$.
Let $\sup_{a \in A, i=1} \sum_{i=1}^m \epsilon_ia_i = \epsilon^T a^*$
Then,
$$f(\epsilon)-f(\tilde{\epsilon}^k) = \sup_{a \in A, i=1} \sum_{i=1}^m \epsilon_ia_i-\sup_{a \in A, i=1} \sum_{i=1}^m \tilde{\epsilon_i}a_i \\ \leq \epsilon^Ta^*-\tilde{\epsilon}^Ta^* = (\epsilon_k - \tilde{\epsilon}_k)a_k^* \leq 2*\sup_{a \in A}|a_k|=L_k$$
> Note from Yan: Here are some notations that I tried to decipher. Hope they could clear some confusions.
- $\epsilon$ vs. $\epsilon_i$: $\epsilon_i$ is an individual Rademacher variable and $\epsilon$ is the vector containing all these $\epsilon_i$.
- $\epsilon_k$ and $\tilde{\epsilon}_k$: they are also vectors like $\epsilon$ and follow the bounded difference notation
- $a^*$ vs. $a_i$ vs. $a_k^*$: $a_i$ is a vector drawn from set $A$, $a^*$ is a matrix whose columns are $a_i$, $a_k^*$ is the kth column.
**Example**: (U-statistics of order 2)
(_textbook 2.23_)
A **pairwise U-statstics** is defined as $$U=\frac{1}{n \choose 2} \sum_{i=1}^n \sum_{j=1, j < i}^m g(X_i, X_j)$$
for some $g: R^2 \rightarrow R, g(s, t)=g(t, s)$
$$E[U]=E_{X, X'}[g(X_1, X_2)]$$
> For instance, if $g(s, t) = |s-t|$, then $E[U] = E_{X_1, X_2}[|X_1-X_2|]$
If we impose the condition that $|g|_\infty \leq b$, we could obtain $BD$ with $L_k=\frac{4b}{n}$. (_Textbook Example 2.23_)
## Lipschitz functions of Gaussian variables
**Def**: $f:\mathbb{R}^n \rightarrow \mathbb{R}$ is **L-Lipschitz** w.r.t. some metric $g$:
$$|f(X)-f(\tilde{X})| \leq Lg(X, \tilde{X})$$
If we use the Euclidean norm $g(X, \tilde{X}) = |X-\tilde{X}|_2$, the following result holds:
**Theorem**: Suppose $f:\mathbb{R}^n \rightarrow \mathbb{R}$ is L-Lipschitz in $l_2$: $|f(x) - f(\tilde{X})| \leq L|X-\tilde{X}|_2 \ \ \ \forall X, \tilde{X} \in R^N$, then we have
$$E[e^{\lambda(f(x)-E[f(x)])}] \leq e^{\frac{\lambda^2 L^2}{2}} where X_i \sim N(0, 1)$$
> A mgf of a Gaussian with variance L^2 for tail behavior.
**Examples** for Lipschitz functions:
(a) $f(X_1, ..., X_n) = \max_{i=1, ..., n} X_i$ is 1-Lipschitz because the worst case after switching the sign the max is still the max.
$E[f(x)] \approx \sqrt{2\sigma^2\log n}$
(b) $f(X_1, ..., X_n) = |X|_2$, is also a 1-Lipschitz function
$$P(|X|_2-E[|X|_2]| \geq t) \leq 2e^{\frac{-t^2}{2}} \forall t \geq 0$$
**Example** (Singular values of random matrices) (_Textbook 2.32_)
$X \in R^{m *d}, X_ij \sim N(0, 1) i.i.d. \ \ \ n=md$
$$f(X) = \frac{|||X|||_2}{\sqrt(n)} = \max_{|u|_2 = 1, u \in R^d} \frac{||X_u||_2} {\sqrt{m}}$$
> max singular value of $\frac{X}{\sqrt{m}}$
What's the Lipschitz constant? $\frac 1 {\sqrt{m}}$
_Proof_:
$f(X) - f(\tilde{X}) = \frac {1}{\sqrt{m}} {|||X|||_2 - |||\tilde{X}|||_2}$
> Need to study: operator norm, linear norm, nuclear norm.
> Triangle inequality reversed
$\leq \frac {1}{\sqrt{m}} |||X-\tilde{X}|||_2$
Like to show that $f(X) - f(\tilde{X}) \leq \frac {1}{\sqrt{m}} \sqrt{\sum_{i=1}^m\sum_{j=1}^d (X_{ij} - \tilde{X}_{ij})^2}$
Frobenius norm
> Leave for exercise: $|||X-\tilde{X}|||_2<|||X-\tilde{X}|||_F$
$P(\frac{||X||_2}{\sqrt{m}} \geq \frac{E[|X|_2]}{\sqrt{m}} +t) \leq e^{\frac{-mt^2}{2}} \forall t \geq 0$
Why rescaling by $\sqrt{m}$?
Sample covariance $\hat{\Sigma} = \frac{X^TX}{m}$
With this rescaling, it's behavioring very close to Gaussian
> q: What's the expectation?
a: $E[\frac{||X|||_2}{\sqrt{m}}] \leq 1 + \sqrt{\frac d m}$ (aspect ratio)
Strongly recommend trying this out in numpy.
First develop concentration, then find the expectation.
$X = (X_1^T ... X_n^T) X_j \sim N(0, I) \\ \hat{\Sigma} = 1/m \sum_{i=1}^m X_iX_i^T$
Next time he wants to do more proofs.