# Lecture 24: Large Deviations: Chebyshev and Chernoff Bound, Wrap up.
Readings: Chapter 20.
A more readable version of these notes is online here:
https://hackmd.io/@vinodnathan/rkdN6-lrn
======================================================
Announcements:
* No recitation tomorrow; instead, optional review session from 10am-4pm in 32-141. You can go at any time.
* No OH for the remainder of the term EXCEPT Thu/Fri (sign up on canvas) and Sun (as normal).
* Final next Tuesday May 23 from 1:30-4:30 at Johnson track. 180 mins, 135 pts. All of L1-L23, weighted more heavily to post-quiz-2 topics.
======================================================
Let's start with the definition of variance from the last lecture.
**Definition**:
1. Variance of $R$ is
$$\mathsf{Var}(R) = \mathsf{Ex}((R-\mathsf{Ex}(R))^2)$$
2. Standard deviation of $R$, denoted $\sigma(R)$, is the (positive) square root of the variance.
A side note: why not just take $\mathsf{Ex}(|R-\mathsf{Ex}(R)|)$ as the variance? Turns out squaring has a whole bunch of nice properties that will be useful as a measure of "spread", e.g. the variance of a sum of two *independent* random variables is the sum of their variances, a property that is not true of the absolute value definition. We will see other reasons later in this lecture as well as the next.
Another way to compute the variance:
Theorem:
$$\mathsf{Var}(R) = \mathsf{Ex}(R^2) - \mathsf{Ex}(R)^2$$
Proof:
\begin{align*}\mathsf{Var}(R) = \mathsf{Ex}((R-\mathsf{Ex}(R))^2) & = \mathsf{Ex}(R^2 - 2\cdot \mathsf{Ex}(R) \cdot R + \mathsf{Ex}(R)^2) \\
& = \mathsf{Ex}(R^2) - 2 \cdot \mathsf{Ex}(R) \cdot \mathsf{Ex}(R) + \mathsf{Ex}(R)^2 \\
& = \mathsf{Ex}(R^2) - \mathsf{Ex}(R)^2
\end{align*}
Example 1: What is the variance of the roll of a die? Let's us the formula above.
$$ \mathsf{Ex}(R^2) = \frac{1+4+9+16+25+36}{6} = \frac{91}{6}$$, so
$$ \mathsf{Var}(R) = \mathsf{Ex}(R^2) - \mathsf{Ex}(R)^2 = 91/6 - (7/2)^2 = \frac{35}{12} = 2 \frac{11}{12}$$
Example 2: Say $R$ is the number of heads among $n$ fair coin flips? First, note that $R$ is the sum of $n$ indicator random variables $I_1,\ldots,I_n$ where $I_j = 1$ if and only if the $j$-th coin flip turned up heads.
1. $\mathsf{Ex}(I_j) = 1/2$, and by linearity of expectation, $\mathsf{Ex}(R) = n/2$.
2. $\mathsf{Var}(R) = \sum_{j=1}^n \mathsf{Var}(I_j)$ by independence.
$$\mathsf{Var}(I_j) = \mathsf{Ex}(I_j^2) - \mathsf{Ex}(I_j)^2 = \mathsf{Ex}(I_j) - \mathsf{Ex}(I_j)^2 = \frac{1}{2} - \frac{1}{4} = \frac{1}{4}$$ where the second equality is because for indicator random variables, $I_j^2 = I_j$. So, $\mathsf{Var}(R) = n/4$.
3. $\sigma(R) = \sqrt{\mathsf{Var}(R)} = \frac{\sqrt{n}}{2}$.
We will later in the lecture that this number $\sqrt{n}$ has a special meaning: there is a good chance that you won't see the number of heads in $n$ coin tosses falling outside the range $[\frac{n}{2} - c\sqrt{n}, \frac{n}{2} + c\sqrt{n}]$ for large enough constants $c>0$. The number of heads is "concentrated around $n/2$".
Theorem: If $R_1,\ldots,R_n$ are **independent** random variables, then
$$\mathsf{Var}(R_1+\ldots+R_n) = \mathsf{Var}(R_1)+\ldots+\mathsf{Var}(R_n)$$
The proof was done in the recitation, but for completeness of the lecture notes, here goes.
Proof:
\begin{align*}
\mathsf{Var}(\sum_{i=1}^n R_i) & = \mathsf{Ex}((\sum_{i=1}^n R_i)^2) - (\mathsf{Ex}(\sum_{i=1}^n R_i))^2 \\
& = \mathsf{Ex}(\sum_{i=1}^n R_i^2 + 2\cdot \sum_{i\neq j} R_iR_j) - \sum_{i=1}^n \mathsf{Ex}(R_i)^2 - 2\sum_{i\neq j}\mathsf{Ex}(R_i)\mathsf{Ex}(R_j)\\
& = \sum_{i=1}^n (\mathsf{Ex}(R_i^2)-\mathsf{Ex}(R_i)^2) + \sum_{i\neq j} (\mathsf{Ex}(R_iR_j) - \mathsf{Ex}(R_i)\mathsf{Ex}(R_j))
\end{align*}
Since for every $i \neq j$, the r.v.s $R_i$ and $R_j$ are independent, $\mathsf{Ex}(R_iR_j) = \mathsf{Ex}(R_i)\mathsf{Ex}(R_j)$, so the second term above vanishes. The first term is just the sum of the variances of all the $R_i$, so there you go. QED.
*Vinod's note*: Take a close look at this proof and note that we did not use the fact that the $n$ random variables are *mutually independent*, rather only that they are *pairwise* independent.
What's an example of pairwise independent events? Say the birthdays of everyone in this class is a uniformly distributed day. Let $I_{i,j}$ be the indicator random variable that is $1$ if and only if $i$ and $j$ have the same birthday. Now, any two *different* events $I_{i,j}$ and $I_{i',j'}$ are independent (think about this and convince yourself this is true) so the set of events $\{I_{i,j}: j \neq i\}$ are pairwise independent. However, the three events $I_{i,j}, I_{j,k}$ and $I_{k,i}$ are *not* mutually independent: knowing that $i$ and $j$ have the same birthday and that $j$ and $k$ have the same birthday tells us for sure that $i$ and $k$ have the same birthday.
*Note*: $\sigma(R_1+R_2)$ is *not* necessarily $\sigma(R_1) + \sigma(R_2)$ even when $R_1$ and $R_2$ *are* independent. But the theorem above tells us that $\sigma(R_1+R_2)^2 = \sigma(R_1)^2 + \sigma(R_2)^2$.
Here is a neat corollary:
Corollary: $\mathsf{Ex}(R^2) > \mathsf{Ex}(R)^2$.
Proof: The variance is always positive.
When are the two equal? When the variance is zero which happens exactly when $R$ is a constant.
**Large Deviation Bounds**
**Theorem (Markov)**: Let $R$ be a **non-negative** random variable. Then,
$$\Pr[R \geq x] \leq \frac{\mathsf{Ex}(R)}{x}$$
In the last lecture, we did the proof for **non-negative integer-valued** random variables. Here is something more general.
Proof:
$$\mathsf{Ex}(R) = \mathsf{Ex}(R | R \geq x) \Pr[R\geq x] + \mathsf{Ex}(R | R < x) \Pr[R < x] $$ by the law of total probabilities applied to expectation. the first expectation term on the RHS is at least $x$ and the second expectation term on the RHS is at least $0$ (this is where we are using non-negativity of $R$.) So,
$$ \mathsf{Ex}(R) \geq x \cdot \Pr[R \geq x] + 0 \cdot \Pr[R < x] \geq x \cdot \Pr[R \geq x]$$
Rearranging the terms, we get
$$ \Pr[R \geq x] \leq \frac{\mathsf{Ex}(R)}{x}$$
QED.
Example: Let $R$ be the weight of a random person. Say $\mathsf{Ex}(R) = 100$. What is the probability that $R \geq 200$?
Answer: We don't have enough information to compute the exact probability, but Markov tells us that this is at most $100/200 = 1/2$.
There is a definite, non-probabilistic, interpretation of this statement: at most half the population weighs at least 200 lbs. There is nothing probabilistic about that.
An alternate form of Markov:
**Theorem (Markov -- alternate form)**: Let $R$ be a **non-negative** random variable. Then,
$$\Pr[R \geq c\cdot \mathsf{Ex}(R)] \leq \frac{1}{c}$$
*Why does Markov need non-negativity?*
Here is a counter example: consider $R$ which takes on the value $-1$ if an unbiased coin comes up heads and $+1$ if it comes up tails. $\mathsf{Ex}(R) = 0$. $\Pr[R \geq 1/2] = 1/2$ but (**incorrectly!**) applying Markov would have us conclude this is at most $0$.
Looking at the proof of Markov above tells us where things go wrong if $R$ is not non-negative. We used that $\mathsf{Ex}(R|R < x)$ is at least $0$ appealing to the non-negativity of $R$.
*What if $R$ takes negative values?*
If you *know* that $R$ is lower-bounded by some $z$, you can apply Markov to $R-z$ which is non-negative.
In a similar vein, if you want Markov to give you the probability that $R$ is *at most* something and you know that $R$ is upper-bounded by some $z$, you can apply Markov to $z-R$ which is also non-negative.
$$\Pr[R \leq x] = \Pr[z-R \geq z-x] \leq \mathsf{Ex}(z-R)/(z-x) = \frac{z-\mathsf{Ex}(R)}{z-x}$$
Say, $R$ is test scores, always at most $100\%$. Say average grade is $75$. The probability that the grade is at most $60$ is at most $\frac{100-75}{100-60} = \frac{5}{8}$.
*Markov is (often) not tight*
Let's look at the cellphone check problem, from L22/23, starting from the lazy suzan version. Recall: $n$ people sit around a table, place their cellphones on a lazy suzan and give it a spin. If $R$ is the number of people who got their cellphones back, then we saw that $\mathsf{Ex}(R) = 1$. What is the probability that all $n$ get their cellphone back?
Markov has an answer. It is $\leq \mathsf{Ex}(R)/n = 1/n$. What's the true answer? Also $1/n$.
Let's look at the original version of the cellphone check problem, where the $n$ phones are permuted and returned. What is the probability that all $n$ get their cellphone back?
Markov has the same answer! It is $\leq \mathsf{Ex}(R)/n = 1/n$. What's the true answer? It is $1/(n!)$. $n! \gg n$, so Markov is way off in the estimate here. The *upper bound* that Markov gives us is *correct* but is *loose*. The true probability is much smaller.
*What if we want tighter bounds*
For that, we need to know something about the probability distribution more than its mean.
**Theorem (Chebyshev):**
For every $x > 0$ and for every r.v. $R$ (**not necessarily non-negative**),
$$ \Pr[|R-\mathsf{Ex}(R)| \geq x] \leq \frac{\mathsf{Var}(R)}{x^2} = \bigg(\frac{\sigma(R)}{x}\bigg)^2$$
where $\sigma(R)$ is the standard deviation of $R$.
Proof: Use Markov! With the (non-negative) random variable $R-\mathsf{Ex}(R)$. Now, and make sure you understand this step,
$$ \Pr[|R-\mathsf{Ex}(R)| \geq x] = \Pr[(R-\mathsf{Ex}(R))^2 \geq x^2] $$
Now apply Markov and get
$$ \Pr[|R-\mathsf{Ex}(R)| \geq x] = \Pr[(R-\mathsf{Ex}(R))^2 \geq x^2] \leq \frac{\mathsf{Ex}((R-\mathsf{Ex}(R))^2)}{x^2} = \frac{\mathsf{Var}(R)}{x^2}$$ QED.
**Theorem (Chebyshev -- alternate form):**
For every $x > 0$ and for every r.v. $R$ (**not necessarily non-negative**),
$$ \Pr[|R-\mathsf{Ex}(R)| \geq c\cdot \sigma(R)] \leq \frac{1}{c^2} $$ where $\sigma(R)$ is the standard deviation of $R$.
*Example 1*: Let's go back to the test scores whose variance is, say $25$ (so the standard deviation is $5$).
$$ \Pr[\mathsf{score} \leq 60] \leq \Pr[|\mathsf{score}-75| \geq 15]$$
Why? The latter probability measures the union of two events --- that $\mathsf{score} \leq 60$ and that $\mathsf{score} \geq 90$.
apply Chebyshev:
$$\Pr[\mathsf{score} \leq 60] \leq \frac{\mathsf{Var}(\mathsf{score})}{15^2} = \frac{25}{15^2} = \frac{1}{9}$$
This is a much better bound than using Markov alone!
*Example 2*: Let's look at the number of heads in a toss of $n$ coins. Here, $$R = R_1 + \ldots + R_n$$ where $R_i$ is the indicator random variable which is $1$ if and only if the $i$-th coin toss came up heads. $$\mathsf{Ex}(R_i) = 1/2 \mbox{ and } \mathsf{Var}(R_i) = \mathsf{Ex}(R_i^2) - \mathsf{Ex}(R_i)^2 = 1/2 - 1/4 = 1/4$$
Now,
$$\mathsf{Ex}(R) = \sum_{i=1}^n \mathsf{Ex}(R_i) = n/2$$ and
$$ \mathsf{Var}(R) = \sum_{i=1}^n \mathsf{Ex}(R_i) = n/4$$
Markov tells us that
$$\Pr[R \geq 3n/4] \leq (n/2)/(3n/4) = 2/3$$
OTOH, Chebyshev tells us that
$$\Pr[R \geq 3n/4] \leq \Pr[|R-n/2| \geq n/4] \leq \frac{\mathsf{Var}(R)}{(n/4)^2} = \frac{(n/4)}{(n/4)^2} = \frac{4}{n}$$ which is a far better bound.
It turns out there is something even better that one can do. Recall that Chebyshev only uses the *pairwise* independence of the coin tosses. For example, if the $n$ random variables where generated by tossing $\sqrt{n}$ coins and letting the $(i,j)$-th random variable be $1$ if and only if the outcome of coin-$i$ is equal to that of coin-$j$. These variables are unbiased, pairwise independent, but NOT mutually independent. Chebyshev still applies!
Using the *mutual* independence of all the coin tosses gives us a better bound via the Chernoff bound (by the same Hermann Chernoff who beat the MA lottery).
**Theorem (Chernoff)** Let Let $T_1,\ldots, T_n$ be mutually independent random variables such that $0\leq T_i \leq 1$ for all $i$. Let $T = T_1+T_2+\ldots+T_n$. Then, for all $c \geq 1$,
$$ \Pr[T \geq c\cdot \mathsf{Ex}(T)] \leq e^{-\beta(c)\cdot \mathsf{Ex}(T)}$$
where $\beta(c) = c \ln c -c + 1$.
The proof, like that of Chebyshev, uses Markov on a different random variable, namely $c^{T}$. For the real proof, I will refer you to the book, section 20.5.6.
Let's apply Chernoff to the coin tosses. We get, letting $c = 3/2$,
$$\Pr[R \geq 3n/4] = \Pr[R \geq 3/2 \cdot n/2] \leq e^{-0.1\cdot n/2} = e^{-n/20}$$ which is an exponentially better bound than Chebyshev!
======================================================
For fun:
<<play Gambler's ruin game>>
Gambler has $100, wants to double it to $200. Plays in a casino where on each trial, she wins with probability $p \leq 1/2$. When she wins, she makes $1; when she loses, she gives up $1. If she ever hits loses all her money, the game is over.
What is the probability that she will make $200?
Turns out if $p = 1/2$ (a fair casino!!) she will make $200 with probability $100/200 = 1/2$.
But if $p = 0.49$, the chance that she will make $200 is at most $$ \frac{(0.51/0.49)^{100}-1}{(0.51/0.49)^{200}-1} \approx 0.018$$
Why does this happen, intuitively? The gambler is in a lose-lose scenario.
In general:
Theorem [Thm 21.1.1 of the book] If the initial capital is $n, the target is $T, and $p$ is the probability of making $1 in each trial, the probability that the gambler reaches the target is
$$ \left\{ \begin{array}{c} \frac{n}{T} & \mbox{if $p=1/2$} \\ \frac{r^n-1}{r^T-1} & \mbox{if $p<1/2$}\end{array} \right. $$ where $r = (1-p)/p$.
======================================================
For fun: You go to Vegas and you are worried that the coin-tosses in Hotel Unbiased is actually biased (but independent). Can you deterministically "process" $n$ coin-tosses into $\approx n$ fair ones?
Von-Neumann's trick: turn a sequence of "HT" into heads and "TH" into tails. Throw away "HH" and "TT". The resulting sequence is unbiased. Indeed, if $p$ is the probability of heads in Hotel Unbiased's coins, $\Pr[HT] = \Pr[TH] = p(1-p)$, so von Neumann's coins are unbiased! (We did use the independence of coin tosses, though.)
This is an example of "randomness extraction" where one turns "dirty sources" of randomness into clean, unbiased and independent coin-tosses. Randomness extraction is super important in theoretical computer science, statistics and cryptography.
======================================================
Alright, that's enough.
That was a fun semester. Good luck with the finals and enjoy the summer!