# Bayes' Theorem: Disease Probabilities ## What We're Given * $D \in \{0, 1\}$ = has disease * $T \in \{0, 1\}$ = test result * (In the real world, however, we cannot observe $D$, only $T$) * 99% accuracy: * $\Pr(T = 1 \mid D = 1) = 0.99$ * $\Pr(T = 0 \mid D = 0) = 0.99$ * Rare disease: 1 in 10000 people has it * $\Pr(D = 1) = \frac{1}{10000}$ ## Basic Bayes: Probability of Disease Given Positive Test We want $$ \Pr(D = 1 \mid T = 1) $$ Which we can compute, using the given information, via Bayes' rule (where the second line is how I like to write it out, a bit more cluttered but makes it clear how to compute the denominator) $$ \begin{align*} \Pr(D = 1 \mid T = 1) &= \frac{\Pr(T = 1 \mid D = 1)\Pr(D = 1)}{\Pr(T = 1)} \\ &= \frac{\Pr(T = 1 \mid D = 1)\Pr(D = 1)}{\Pr(T = 1 \mid D = 1)\Pr(D = 1) + \Pr(T = 1 \mid D = 0)\Pr(D = 0)} \end{align*} $$ Numerically: $$ \Pr(D = 1 \mid T = 1) = \frac{(0.99)(1/10000)}{(0.99)(1/10000) + (0.01)(9999/10000)} \approx 0.0098 $$ (Less than 1%) ## Deeper Dive But now let's think about what's behind this... It's a dangerous, **highly contagious** disease, meaning that (societally) false **negatives** are much, much worse than false **positives**: * A false **negative**, in this case, means that someone is walking around thinking they **don't** have the disease (because they tested negative), when they actually **do**. This means that they are not quarantining, they are going out to parties and events and etc., spreading the disease. * A false **positive**, on the other hand, means someone who panics unnecessarily: maybe it means, they go to the hospital, the hospital performs additional tests, and successfully discovers that the person, despite their positive test result, doesn't have the disease. * So, **consequence-wise**, a **false negative** potentially means a new outbreak of the disease in the society, while a **false positive** means a quick (scary, but hopefully quick) trip to the hospital ### The Catastrophic Case So, let's focus on the disastrous first case: what's the probability of a **false negative**? First, we can compute the **conditional** probability of a negative test, conditional on someone having the disease? Here we just use our **complement rule** of probability: that $\Pr(E^c) = 1 - \Pr(E)$ for any event $E$: $$ \Pr(T = 0 \mid D = 1) = 1 - \Pr(T = 1 \mid D = 1) = 0.01 $$ Now that we know this, let's incorporate the **base rate** information---that is, the information we have about the likelihood of having the disease (the thing we conditioned on above): $$ \Pr(D = 1) = \frac{1}{10000} $$ So, given these two pieces of information, we can compute the probability of a person in the population being a **false negative case**: having the disease, but not being detected by the test. $$ \begin{align*} \Pr(T = 0 \cap D = 1) &= \Pr(T = 0 \mid D = 1)\Pr(D = 1) \\ &= (0.01)(1/10000) = \frac{1}{1000000}, \end{align*} $$ i.e., **one in a million**. ### The Bad (But Not Catastrophic) Case Now let's turn to the second, bad but not catastrophic, case: the probability of a **false positive**. As before, we start by computing the **conditional probability** of a positive test result for someone who in fact does **not** have the disease: $$ \Pr(T = 1 \mid D = 0) = 1 - \Pr(T = 0 \mid D = 0) = 0.01 $$ This time, however, we'll see that the base rate will make a big difference. The base rate in this case---the probability of someone **not** having the disease---is: $$ \Pr(D = 0) = 1 - \Pr(D = 1) = \frac{9999}{10000} $$ So, incorporating these two pieces of information, we can compute the likelihood of a **false positive case**: someone in the population who **doesn't** have the disease but **does** test positive: $$ \begin{align*} \Pr(T = 1 \cap D = 0) &= \Pr(T = 1 \mid D = 0)\Pr(D = 0) \\ &= (0.01)(9999/10000) = \frac{9999}{1000000} \end{align*} $$ In words: for every million people in the population, 9999 of them will have a **false positive** panic: they won't have the disease, but they will **think** they have the disease because of their positive test. ## Putting It Together: At first, this example is depressing: "Oh no, that's terrible! We're forcing thousands of people to panic, thinking that they have the disease, when they really don't!" But, walking through it with this false negative / false positive paradigm, we see the real takeaway: that there is always a **tradeoff** between false positives and false negatives. In this case, from a public health perspective for example, it's actually somewhat of a **good** situation: at the "cost" of having several thousand people panic unnecessarily, we **achieve** the benefit of making it very, very unlikely (one in a million, literally) that someone goes undetected in the population with this dangerous, contagious disease. <table> <thead> <tr style="border-top: 0px solid black;"> <th colspan="2" rowspan="2" style="border-top: 0px solid black; border-left: 0px solid black;"></th> <th colspan="2">True State of the World</th> </tr> <tr> <th>0</th> <th>1</th> </tr> </thead> <tbody> <tr> <td rowspan="2"><b>Prediction</b></td> <td><b>0</b></td> <td>True Negative</td> <td>False Negative</td> </tr> <tr> <td><b>1</b></td> <td>False Positive</td> <td>True Positive</td> </tr> </tbody> </table> ### (Compuation of the remaining two cells:) **True Positive**: $$ \begin{align*} \Pr(T = 1 \cap D = 1) &= \Pr(T = 1 \mid D = 1)\Pr(D = 1) \\ &= (0.99)\frac{1}{10000} = \frac{99}{1000000} \end{align*} $$ **True Negative**: $$ \begin{align*} \Pr(T = 0 \cap D = 0) &= \Pr(T = 0 \mid D = 0)\Pr(D = 0) \\ &= (0.99)\frac{9999}{10000} = \frac{989901}{1000000} \end{align*} $$