Probability and Statistics

# Probability and Statistics > Condensed notes from basic probability and statistics ## Basic probability ### Sample Space All possible outcomes of a random experiment: $\wp$ ### Event A subset of a sample space. For any event $A$: > $P(A^c) = 1 - P(A)$ > > $P(A) = 1 - P(A^c)$ ### Equally Likely Outcomes Finitely many outcomes `N` All outcomes are known For any event $A$, $P(A) = \frac{\|A\|}{N}$ #### Scenario I, the easy case When it is practical to list all outcomes in $\wp$ `ex` roll two dice: >$P(\text{sum} = 8): \, (2,6)\,,\,(6,2)\,,\,(5,3)\,,\,(3,5)\,,\,(4,4) = \frac{5}{36}$ #### Scenario II, combined experiments When *RE* can be expressed as a sequence of `k` smaller *RE*s with $n_1, n_2, ... , n_k$ possible outcomes, then $N = \prod_{i=1}^{k} n_i$ `ex` flipping `n` coins, $N = 2^n$ `ex` rolling `n` dice, $N = 6^n$ `ex` small school with three classes, want to select a group of three students, one from each class *at random* where > c1: 6 boys, 4 girls > c2: 5 boys, 7 girls > c3: 6 boys, 9 girls $$ P(\text{all girls}) = \frac{\text{product of girls in each class}}{\text{product of students in each class}} = \frac{4 \cdot 7 \cdot 9}{10 \cdot 12 \cdot 15} $$ $$ P(\text{one boy, two girls}) = \frac{6 \cdot 7 \cdot 9 \,+\, 4 \cdot 5 \cdot 9 \,+\, 4 \cdot 7 \cdot 6}{10 \cdot 12 \cdot 15} $$ #### Scenario III, permutations When a *RE* consists of arranging objects in some order `ex` how many ways can three horses out of eight finish in the top three places: $$ \frac{\text{arrange three horses} \,\times\, \text{arrange remaining five}}{\text{arrange all eight}} = \frac{3! \cdot 5!}{8!} = \frac{6 \cdot 5!}{8!} = \frac{6!}{8!} = \frac{1}{56} $$ `ex` ten children randomly seated in a line: $$ P(\text{two friends sit together}) = \frac{2! \cdot 8! \cdot 9}{10!} $$ #### Scenario IV, subset selections When a *RE* consists of randomly selecting `k` distinct objects out of `n` total objects $$ {n \choose k} \text{ or nCk where } k \leq n = \frac{n!}{k! \, (n-k)!} $$ `ex` a box of thirty items, of which nine are broken, pick six at random: $$ P(2 \text{ broken}) = \frac{\text{pick 4/21 working} \times \text{pick 2/9 broken}}{\text{pick 6/30}} = \frac{{21 \choose 4} \cdot {9 \choose 2}}{{30 \choose 6}} $$ $$ P(\text{at least }3\text{ broken}) = 1 - P(\text{at most }2\text{ broken}) = 1 - P(\text{none}\cup\text{one}\cup\text{two}) \\= 1 - (P(0) + P(1) + P(2)) = 1 - \frac{{21 \choose 6}{9 \choose 0} + {21 \choose 5}{9 \choose 1} + {21 \choose 4}{9 \choose 2}}{{30 \choose 6}} = 0.237 $$ ## Conditional probability Probability of an event when information is known *a priori* `ex` P(two aces); poker hand already has at least one king $$ P(A \,|\, B) = \frac{P(A \cap B)}{P(B)} $$ `ex` in some town where `60%` own a dog, `30%` own a cat, `12%` own both. Given a household owns a dog, what is the probability they own a cat as well? $$ P(C \,|\, D) = \frac{P(C \cap D)}{P(D)} = \frac{0.12}{0.6} = 0.2 $$ `ex` Mr. Smith has two children. What is the probability he has two boys, given that we know he has one? $$ \wp: \{BB, BG, GB, GG\}$$$$P(2B \,|\, 1B) = \frac{P(2B \cap 1B)}{P(1B)} = \frac{P(2B)}{P(1B)} = \frac{P(2B)}{1 - P(2G)} = \frac{\frac{1}{4}}{1 - \frac{1}{4}} = \frac{1}{3} $$ ### Bayes' Theorem Events $B_1, ..., B_n$ are a partition of $\wp$ if: > $\forall \, i \neq j \; B_i \cap B_j = \varnothing$ > > $\bigcup_{i=1}^{n} B_i = \wp$ If $B_1, ..., B_n$ are a partition of $\wp$, then for any event $A$: > $P(A) = \sum_{i=1}^{n} P(A \,|\, B_i) \cdot P(B_i)$ ("law of total probability") > > $P(B_j \,|\, A) = \frac{P(A \,|\, B_j) \cdot P(B_j)}{P(A)}$ where $P(A)$ is given in (1) ### Independent events For any events $A$ and $B$, if $P(A \,|\, B) = P(A)$ then $A$ and $B$ are **mutually independent events**, and $P(A\cap B) = P(A) \cdot P(B)$ ## Discrete random variables The probability distribution of a DRV $X$ takes the form of a probability mass function, $p(X) = P(X=x) \; \forall \, x \in \mathbb{R}$ 1. for any $x \in \mathbb{R}$, $0 \leq p(x) \leq 1$ 2. if $x \notin R(X), \, p(x) = 0$ where $R(X)$ is the range of the DRV $X$ 3. $\sum_{\text{all x}}^{} p(x) = 1$ 4. $P(3 \leq X \leq 5) = p(3) + p(4) + p(5)$ #### Expected value > $E(X) = \sum_{\text{all x}}^{} x\cdot p(x)$ > > $E(g(X)) = \sum_{\text{all x}}^{} g(x) \cdot p(x)$ > > $E(a\,X + b) = a\,E(X) + b$ #### Variance, standard deviation, coefficient of variation > $\sigma^2 = var(X) = E((X - E(X))^2) = E(X^2) - E(X)^2$ > > $var(a\,X + b) = a^2\,var(x)$ > > $\sigma = SD(X) = \sqrt{var(X)}$ > > $SD(a\,X + b) = |a|\,SD(X)$ > > $CV(X) = \frac{SD(X)}{E(X)}$ ### Binomial RV For a binomial random variable $X$, denoted $X \sim bin(n,p)$, the general probability mass function is given by: $$ p(x) = {n \choose k} \, p^x \, (1-p)^{n-x}\,, \; x\,\in\mathbb{Z}^{+} $$ $$ E(X) = n\cdot p\,,\; var(X) = np\,(1-p)\,,\; SD(X) = \sqrt{np\,(1-p)} $$ ### Poisson RV The probability mass function of a Poisson process, denoted $X \sim Pois(\lambda)$, is given by the binomial RV in the limit: $$ p(x) = \frac{e^{-\lambda} \lambda^x}{x!}\,, \; x\,\in\mathbb(Z)^{+} $$ $$ E(X) = \lambda\,,\; var(X) = \lambda\,,\; SD(X) = \sqrt{\lambda} $$ ### Helpful series sums $$ \sum_{i=1}^{n} i = \frac{n(n+1)}{2} $$ $$ \sum_{x=0}^{\infty} q^x = \frac{1}{1-q} $$ $$ \sum_{x=1}^{\infty} q^x = \frac{q}{1-q} $$ $$ \sum_{x=0}^{\infty} x\cdot q^x = \frac{q}{(1-q)^2} $$ ## Continuous random variables The probability density function of a continuous random variable $X$ is a function $f(x)$ such that for any interval $[a,\,b]\in\mathbb{R}$: $$ P(a \leq X \leq b) = \int_{a}^{b} f(x) \, dx $$ 1. $f(x) \geq 0$ but $f(x) \nleq 1$ necessarily 2. $\int_{-\infty}^{\infty} f(x) \, dx = 1$ 3. if $x \notin R(X)$ then $f(x) = 0$ 4. for **any** CRV, $P(X = x) = 0 \implies P(a\leq X\leq b) = P(a \lt X \lt b)$ #### Expected value $$ E(X) = \int_{-\infty}^{\infty} x\cdot f(x) \, dx $$ $$ E(g(X)) = \int_{-\infty}^{\infty} g(x)\cdot f(x) \, dx $$ $E(X)$, $E(g(X))$, $var(X)$, and $SD(X)$ are the same as in the DRV case ### Uniform RV A continuous random variable, $X$, is called "uniform" on `(a, b)` if $f(x) = \frac{1}{b-a}$ on `(a, b)` and is denoted $X \sim U(A,\,B)$ $$ P(a \lt c \lt X \lt d \lt b) = \frac{d-c}{b-a} $$ 1. $E(X) = \frac{A+B}{2}$ 2. $var(X) = \frac{(B-A)^2}{12}$ 3. $SD(X) = \frac{B-A}{2\sqrt{3}}$ ### Normal RV A continuous random variable, $Z$, is *Gaussian* or called a standard normal RV if $f(z) = \frac{1}{\sqrt{2 \pi}} exp[-z^2/2]$ for $z\in\mathbb{R}$ and is denoted $Z\sim N$ 1. $P(Z < z)$ is given by $\varphi(z)$, found using a Z-table 2. $P(Z > z) = 1 - P(Z < z) = 1 - \varphi(z)$ 3. $E(X) = 0$ 4. $var(Z) = 1$ 5. $SD(Z) = 1$ ### Non-standard normal RV Denoted $X\sim N(\mu,\,\sigma)$, non-standard normal RVs are not always centered at 0, and $SD(Z) \neq 1$ necessarily Let $X = Z\sigma + \mu$, then $$ f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2 \sigma^2}}\,,\; x\in\mathbb{R}\,,\; E(X) = \mu\,,\; var(X) = \sigma^2\,,\; SD(X) = \sigma $$ It follows that $Z = \frac{X - \mu}{\sigma}$, and therefore $$ P(X < x) = P\left(\frac{X - \mu}{\sigma} \lt \frac{x - \mu}{\sigma}\right) = P\left(Z \lt \frac{x - \mu}{\sigma}\right) = \varphi\left(\frac{x - \mu}{\sigma}\right) $$ $$ P(X \lt x) = \varphi\left(\frac{x - \mu}{\sigma}\right) $$ $$ P(X \gt x) = 1 - \varphi\left(\frac{x - \mu}{\sigma}\right) $$ $$ P(a\lt X\lt b) = \varphi\left(\frac{b - \mu}{\sigma}\right) - \varphi\left(\frac{a - \mu}{\sigma}\right) $$ ## Statistics ### Confidence interval for $\mu$ Estimate $\mu$ with the sample mean, $\bar{X}$ Then the $100 (1-\alpha)\%$ confidence interval for $\mu$ is given by: $$ \bar{X} \pm Z_{\alpha/2}\,\frac{\sigma}{\sqrt{n}} $$ For a margin of error not exceeding some $E^{*}$, the sample size is then: $$ n = \left(\frac{Z_{\alpha/2}\;\sigma}{\sqrt{n}}\right)^2 $$ When $\sigma$ is unknown but $n \gt 30$, the CI is given by: $$ \bar{X} \pm Z_{\alpha/2}\cdot\frac{s}{\sqrt{n}} $$ When $n \lt 30$, the CI is given by: $$ \bar{X} \pm t_{n-1,\;\alpha/2}\cdot\frac{s}{\sqrt{n}} $$ ### Hypothesis testing about the mean The test statistic, $T$, is given by: $$ T = \frac{\bar{X} - \mu_0}{s/\sqrt{n}} $$ The critical value, $T^*$, is given by: $$ T^* = \begin{cases} \;n \gt 30, \; H_a \text{ is one-sided :} &\, Z_{\alpha}\\ \;n \leq 30, \; H_a \text{ is one-sided :} &\, t_{n-1,\;\alpha}\\ \;n \gt 30, \; H_a \text{ is two-sided :} &\, Z_{\alpha/2}\\ \;n \leq 30, \; H_a \text{ is two-sided :} &\, t_{n-1,\;\alpha/2} \end{cases} $$ #### Conclusions 1. $H_a \,:\, \mu \gt \mu_0$, then we reject $H_0$ if $T > T^*$ 2. $H_a \,:\, \mu \lt \mu_0$, then we reject $H_0$ if $T < -T^*$ 3. $H_a \,:\, \mu \neq \mu_0$, then we reject $H_0$ if $|T| > T^*$ ### p-value $$ H_a: \begin{cases} \mu \gt \mu_0 &:\; 1 - \varphi(T) \\ \mu \lt \mu_0 &:\; \varphi(T) \\ \mu \neq \mu_0 &:\; 2\,(1 - \varphi(|T|)) \end{cases} $$ #### Conclusions $$ \overbrace{p \lt 0.01 \lt p \lt}^{reject}\alpha\overbrace{\lt p \lt 0.1 \lt p}^{accept} $$ ### Inference about proportions $p$ = proportion (%) of a population in a certain category $p$ is unknown and the data is boolean First, estimate $\hat{p}$, the sample proportion: $\hat{p} = \frac{\text{# True}}{n}$ $100 (1-\alpha)\%$ confidence interval for $p$, assuming $n > 30$ is given by: $$ \hat{p} \pm Z_{\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} $$