Lecture 23: Expectation and Variance

# Lecture 23: Expectation and Variance Readings: Section 19.5, 19.6, 20.3. A more readable version is online here: https://hackmd.io/@vinodnathan/ryTBXV_4n Theorem: Given a probability space $S$ and events $A_1,\ldots,A_n \subseteq S$, let $T$ be a random variable that counts the number of events that occur. Then, $$\mathsf{Ex}(T) = \sum_{i=1}^n \Pr(A)$$ Proof: Let $I_j$ be the indicator random variable corresponding to event $A_j$ occurring. That is, $I_j(\omega) = 1$ if and only if $\omega \in A_j$. Then, and this is important to internalize, $$T = I_1 + I_2 + \ldots + I_n$$ So, $$\mathsf{Ex}(T) = \mathsf{Ex}(I_1 + I_2 + \ldots + I_n) = \sum_{j=1}^n \mathsf{Ex}(I_j) = \sum_{j=1}^n \Pr(A_j)$$ where the second equality is by linearity of expectation, and the last equality because the expectation of an indicator random variable corresponding to an event is nothing but the probability that the event occurs. (See Example $2$ from Lecture $22$). QED. Example: Flip $n$ coins. $H$ is the number of heads. $\mathsf{Ex}(H) = \sum_{i=1}^n \Pr[\mbox{i-th coin toss is heads}] = \frac{n}{2}$. Let's calculate this expectation directly using the definition of expectation. $$\mathsf{Ex}(H) = \sum_{i=0}^n i \cdot \frac{n \choose i}{2^n} $$ which by our calculation above, *must be* $n/2$. OTOH, it takes a bit more effort to compute this summation directly. Linearity of expectation can give us a shortcut to compute (some) complicated sums. Cool, huh?! Let's now look at a setting where the expectation is small. Such as the cellphone check problem from Lecture 22/23. Or when you are constructing a highrise building and you have to account for a bunch of (hopefully) rather unlikely adverse events such as, the probability of an earthquake hitting the area, or the probability of a flood, or the probability that the construction materials are of a lower quality than you thought. Yada yada... What you'd like to figure out is the probability that *any* of these adverse events occurs. Lemma: Let the random variable $T$ be as above (i.e. $T$ counts the number of events out of $n$ that occur). Then, $$\Pr[T > 0] \leq \mathsf{Ex}(T)$$ Proof: From Lecture 22, we know that $$\mathsf{Ex}(T) = \sum_{i=0}^\infty \Pr[T > i] = \sum_{i=0}^{n-1} \Pr[T > i] \geq \Pr[T > 0]$$ simply because all terms are non-negative, and we are picking out one out of $n$ terms. QED. Using the fact that $\mathsf{Ex}(T) = \sum_{i=1}^n \Pr(A_i)$, we get the following lemma, called the **union bound**. Lemma: For $n$ events $A_1,\ldots,A_n$ (not necessarily independent), \begin{align*} \Pr[\mbox{at least one of } A_1 & \mbox{ or } A_2 \mbox{ or } \ldots \mbox{ or } A_n \mbox{ occurs}] \\ & := \Pr[A_1 \cup A_2 \cup \ldots \cup A_n] \leq \sum_{i=1}^n \Pr(A_i) \end{align*} The probability of the union of $n$ events is at most the sum of the probabilities of the events which is intuitive: when you compute the sum of probabilities of the $n$ events, you are double-counting probabilities of outcomes that lie at the intersection of two or more events. So, can you see when the inequality in the lemma above is an equality? *Note that the union bound, like the linearity of expectation, does not assume any independence.* **Other properties of Expectation** Lemma: $\mathsf{Ex}(R_1 \cdot R_2) = \mathsf{Ex}(R_1)\cdot \mathsf{Ex}(R_2)$ *if $R_1$ and $R_2$ are independent*. Can you come up with a simple counterexample to the lemma when $R_1$ and $R_2$ are not independent? Proof: Let $\Gamma$ denote $\mathsf{Range}(R_1)\cup \mathsf{Range}(R_2)$. Then: \begin{align*} \mathsf{Ex}(R_1) \cdot \mathsf{Ex}(R_2) & = \bigg(\sum_{x\in \mathsf{Range}(R_1)} x\cdot \Pr[R_1=x]\bigg) \cdot \bigg(\sum_{y\in \mathsf{Range}(R_2)} y\cdot \Pr[R_2=y]\bigg) \\ & = \sum_{x\in \mathsf{Range}(R_1), \\ y\in \mathsf{Range}(R_2)} xy\cdot \Pr[R_1=x] \Pr[R_2=y] \\ & = \sum_{x\in \mathsf{Range}(R_1), \\ y\in \mathsf{Range}(R_2)} xy\cdot \Pr[R_1=x \wedge R_2=y] \\ & = \sum_{z\in \mathsf{Range}(R_1R_2)} z\cdot \big(\sum_{x,y\ s.t.\ xy=z} \Pr[R_1=x \wedge R_2=y]\big) \\ & = \sum_{z\in \mathsf{Range}(R_1R_2)} z\cdot \Pr[R_1R_2=z] \\ & = \mathsf{Ex}(R_1R_2) \end{align*} Example: $R_1$, $R_2$ are independent rolls of dice. $$\mathsf{Ex}(R_1R_2) = \mathsf{Ex}(R_1)\mathsf{Ex}(R_2) = \frac{7}{2}\cdot \frac{7}{2} = \frac{49}{4}$$ Lemma: Let $a,b$ be real numbers. Then, $$\mathsf{Ex}(aR+b) = a\cdot \mathsf{Ex}(R)+b$$ Proof: Think of $a$ and $b$ as random variables which are constant functions that map everything to a (resp. b). $$\mathsf{Ex}(aR+b) = \mathsf{Ex}(aR) + \mathsf{Ex}(b) = \mathsf{Ex}(a)\cdot \mathsf{Ex}(R) + \mathsf{Ex}(b) = a\cdot \mathsf{Ex}(R) + b$$ since the expected value of a constant random variable is the constant. **[Did not cover in lecture] A Fallacy** How about ratios? Is $\mathsf{Ex}(1/R) = 1/\mathsf{Ex}(R)$? This is very much not true. Try to find counterexamples. Along the same lines, here is a "theorem" (which is not true): If $\mathsf{Ex}(R/T) > 1$, then $\mathsf{Ex}(R) > \mathsf{Ex}(T)$. And a "proof" even! "Proof": 1. Since $\mathsf{Ex}(R/T) >1$, $\mathsf{Ex}(R/T)\cdot \mathsf{Ex}(T) > \mathsf{Ex}(T)$. 2. The LHS is $\mathsf{Ex}(R/T)\cdot \mathsf{Ex}(T) = \mathsf{Ex}(R/T\cdot T) = \mathsf{Ex}(R)$. 3. So, QED, $\mathsf{Ex}(R) > \mathsf{Ex}(T)$. What's wrong with this "proof"? (1) The first step assumes that $\mathsf{Ex}(T)$ is positive which it may not be; and (2) the second step assumes that $R/T$ and $T$ are independent which they, in general, aren't (can you figure out examples of random variables $R$ and $T$ where these are independent?) To be sure, the "theorem" is very much NOT a theorem. It is false in many different ways, and yet, people fall into this trap all the time. Here is an example: A paper in the olden days compared the length of code one can write for various benchmarks for two architectures --- RISC vs. Z8002 thus: | Benchmark | RISC | Z8002 | Z/R | | -------- | -------- | -------- | -------- | | 1 | 150 | 120 | 0.8 | | 2 | 120 | 180 | 1.5 | | 3 | 150 | 300 | 2 | | 4 | 2800 | 1400 | 0.5 | The average of $Z/R$ is $4.8/4 = 1.2$. Can we conclude that Z8002 is 20% more verbose than RISC? No! This is exactly the type of thing (i.e. the average of ratios) that we are not supposed to do. To make this point, let's look at the average of $R/Z$. | Benchmark | RISC | Z8002 | R/Z | | -------- | -------- | -------- | -------- | | 1 | 150 | 120 | 1.2 | | 2 | 120 | 180 | 0.67 | | 3 | 150 | 300 | 0.5 | | 4 | 2800 | 1400 | 2 | The average of $R/Z$ is $\approx 1.1$. So, if we concluded before that Z8002 is 20% more verbose than RISC, by the same fallacious argument, we would now have to conclude that RISC is 10% more verbose than Z8002!! What is the right thing to do? First, one has to determine with what probability each of these benchmarks occur in the wild. Let's for now assume that all four benchmarks are equally likely. It's worth keeping in mind that whatever we calculate now depends on our assumptions about these probabilities. Then, $\mathsf{Ex}(R) = 805$ and $\mathsf{Ex}(Z) = 500$. From this, we can conclude that $R$ is longer than $Z$ on average. But again, this is only valid subject to the validity of our assumption about the probability of benchmarks! **Variance** Consider two games. 1. Toss a coin; if it comes up heads, you give me $1, and if it's tails, I give you $1. 2. Toss a coin; if it comes up heads, you give me $1000, and if it's tails, I give you $1000. Which one do you feel comfortable participating in? The random variables ($R_1$, resp. $R_2$) measuring the amount you win in both experiments has $0$ expectation. Yet, the risk of going bankrupt is much greater in game 2. Variance and standard deviation are measures of how much we expect a random variable to differ from its expectation. How do we define such a thing? How about $\mathsf{Ex}(R-\mathsf{Ex}(R))$? Sound like a good idea as it measures the difference between the random variable $R$ and its expectation $\mathsf{Ex}(R)$. But, $$ \mathsf{Ex}(R-\mathsf{Ex}(R)) = \mathsf{Ex}(R)- \mathsf{Ex}(\mathsf{Ex}(R)) = \mathsf{Ex}(R)-\mathsf{Ex}(R) = 0$$ where the first equality is by linearity of expectation, and the second because $\mathsf{Ex}(R)$ is a constant whose expectation is the constant itself. The problem is that the values of $R$ that are smaller than the expectation cancel out the ones that are larger, in equal measure, because that's what the expectation is supposed to measure! **Definition**: 1. Variance of $R$ is $$\mathsf{Var}(R) = \mathsf{Ex}((R-\mathsf{Ex}(R))^2)$$ 2. Standard deviation of $R$, denoted $\sigma(R)$, is the (positive) square root of the variance. A side note: why not just take $\mathsf{Ex}(|R-\mathsf{Ex}(R)|)$ as the variance? Turns out squaring has a whole bunch of nice properties that will be useful as a measure of "spread", e.g. the variance of a sum of two *independent* random variables is the sum of their variances, a property that is not true of the absolute value definition. We will see other reasons later in this lecture as well as the next. Example: The variance of $R_1$ in the game above is $1$ and the standard deviation is $1$. The variance of $R_1$ in the game above is $(1000)^2$ and the standard deviation is $1000$. As you can see, the standard deviation measures what we intuitively think of as the spread of the distribution.