Note - Probability

--- tags: 統計 title: Note - Probability --- week1 --- ## Ch1 - Intro * What is data? * Something we care about for a sepcific problem. * What is data science? * Solving problems using data. * Statistical Inference: sample data to form the population to answer make inference for a specific problem. * Descriptive Static: Just simplify the view of data. (e.g. avg, median, ...) * Questions being of interest: * Questions w/ randomness (ex. weather forecast) * ex.1: Stochastic phenomenon of population * ex.2: Sampling w/ uncertainty ## Ch2 - Probability Theory ### Term Definition * sample space $\Omega$: the set contains all possible outcomes of an experiment. * event: subset of $\Omega$ * ex. 投擲一枚公正硬幣2次: * $\Omega=\{正正, 正反, 反正, 反反\}$ * $P(出現一次反面) = P(\{正反, 反正\})$ * Statement: only true or false * Proposition: 若 $p$ 則 $q$，其中 $p,\ q$ 為 statement。 * $\lnot p\Rightarrow$ proposition is always true。 * Long-run relative probability (x) * Personal-probability interpretation Axiom of probability 1. $P(A)\in [0,1], \forall \text{event A}$ 2. $P(\Omega)=1$ 3. Finite additive axiom: * If $\{A_k|k=1,...,K\}$ are mutually disjoint, then $P(\bigcup\limits^{K}_{k=1}A_k)=\sum\limits^{K}_{k=1}P(A_k)$ * Note: [2] & [3] can imply $p(\phi)=0$ * Laplace: Only need to define the probability of each single outcome, then probabilities of all event are defined. * Reason: these outcomes forms the partition of $\Omega$, and they are mutually disjoint. week2 ---------- ### Countable When $|\Omega|<\infty$, say $\Omega=\{w_1,...,w_k\}$, $P$ can be simply defined by the following procedure: 1. Define $P(\{w_k\})=p_k$, with $p_k\geq 0$ and $\sum\limits^{K}_{k=1}p_k=1$ 2. $\forall\text{ event A}$, define $P(A)=\sum\limits^{K}_{k=1}p_k1(w_k\in A)$ * $1()$ is a indicator function, if the condition is true, return 1, else return 0 * $P:2^\Omega\longrightarrow[0,1]$ or $P:A\longmapsto\sum\limits^{K}_{k=1}p_k1(w_k\in A)$ When $\Omega$ is countable, say $\Omega=\{w_1,w_2,...\}$, we can define $P$ in a similar way. 1. Define $P(\{w_k\})=p_k$, with $p_k\geq 0$ and $\sum\limits^{\infty}_{k=1}p_k=1$ * $\lim\limits_{K\rightarrow\infty}\sum\limits^{K}_{k=1}p_k=1$ 2. $\forall\text{ event A}$, define $P(A)=\sum\limits^{\infty}_{k=1}p_k1(w_K\in A)$ * Countable: 可以把集合內的元素以某種排序方式排列 -> 可以驗機率和是否是1 * We have promised that $P(\bigcup\limits^{\infty}_{k=1}A_k)=\sum\limits^{\infty}_{k=1}P(A_k)$, for disjoint events $A_1, A_2,...$ #### Countable version of the Axiom of Probability 1. $P(A)\in [0,1], \forall A \subset \Omega$ 2. $P(\Omega)=1$ 3. Countable additive axiom: * If $A_1,A_2,...\subset\Omega$ are mutually disjoint, then $P(\bigcup\limits^{\infty}_{k=1}A_k)=\sum\limits^{\infty}_{k=1}P(A_k)$ * Note: [2] & [3] can imply $p(\phi)=0$ ### Uncountable When $\Omega$ is uncoubtable, such $P$ may (does) not exist on $2^\Omega$. Alternative solution * Define $P$ for **some** but many enough events. At least, $\Omega$, $\phi$, $\{w_k\}$ include. * If $A$ has prob, then $A^c$ has prob. * If $A_1,A_2$ have prob, them $\bigcup\limits^{\infty}_{k=1}A_k$ has prob. Definition: A collection $\mathcal F$ of events is called a $\sigma$-algebra if: 1. $\phi\in\mathcal F$ 2. $A^c\in F$ if $A\in\mathcal F$ 3. $\bigcup\limits^{\infty}_{k=1}A_k\in\mathcal F$ if $A_1,A_2,...\in\mathcal F$ * By 2. 3., $\bigcap\limits^{\infty}_{k=1}A_k$ has prob. * $(\bigcap\limits^{\infty}_{k=1}A_k)^c=\bigcup\limits^{\infty}_{k=1}A_k^c$ #### Axioms of Probability 1. $P(A)\geq 0$, $\forall A\in\mathcal F$ 2. $P(\Omega)=1$ 3. $P(\bigcup\limits^\infty_{k=1}A_k)=\sum\limits^{\infty}_{k=1}$ if $A_1,A_2,...\in\mathcal F$ is **disjoint**. * $(\Omega,\mathcal F,\mathbb P)$ is called a **probability space**. ### Conditinal Probability Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space. Let $A,B\in\mathcal F$ and $P(B)>0$ Then conditional probability of $A$ given $B$ is defined as: $P(A\vert B)=\frac{P(A\cap B)}{P(B)}$ Week3 --- ### Dependent/Independent Let $(\Omega,\mathcal F, \mathbb P)$ be a probability space. 1. $A,B\in\mathcal F$ are **independent**, if $P(A\cap B)=P(A)P(B)$ 2. A collection $\{A_k\in\mathcal F:k\in \mathbb N\}$ is mutually independent, if $P(\bigcap\limits^{k}_{j=1}A_{i_j})=\prod\limits^{k}_{j=1}P(A_{i_j}),\ \forall1\leq i_1<...<i_k$ * $i_j$: j-th index of $A$ > 已知$P(A\vert B)$, 求$P(B\vert A)$? > > $P(A\vert B)=\frac{P(A\cap B)}{P(B)}=\frac{P(A\cap B)}{P(B\cap(A\cup A^c))}=\frac{P(A\cap B)}{P((B\cap A)\cup(B\cap A^c))}=\frac{P(A\cap B)}{P(B\cap A)+P(B\cap A^c)}$ > > 又$P(B\vert A)=\frac{P(A\cap B)}{P(A)}\Rightarrow P(A\cap B)=P(B\vert A)\times P(A)$ > > $\Rightarrow P(A\vert B)=\frac{P(B\vert A)\times P(A)}{P(B\vert A)\times P(A) + P(B\vert A^c)\times P(A^c)}$ > > $P(B\vert A^c), P(A^c)$ 需已知 ### Partition Let $\Omega$ be a sample space. A collection of events $\{C_k\subset\Omega:k\in\mathbb N\}$ is called a partition, if 1. $\bigcap\limits^{\infty}_{k=1}C_k=\Omega$ 2. $C_k$'s are disjoint ### Bayes Rule Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space and $A\in\mathcal F$ with $P(A)>0$. Let $\{C_k\in\mathcal F:k\in\mathbb N\}$ be a partition. Then $P(C_k\vert A)=\frac{P(A\vert C_k)P(C_k)}{\sum\limits^{\infty}_{k=1}P(A\vert C_i)P(C_i)}$ ### Random Variable The random variable, usually denoted by $X$, can be viewed as a function mapping $\Omega$ to $\mathbb R$. ($X:\Omega\rightarrow\mathbb R$ and $X^{-1}(A)\in\mathcal F,\ \forall A\in\mathcal B$) > $\{X\geq 1\}=\{\omega\in\Omega:X(\omega)\in[1,\infty)\}=X^{-1}([1,\infty))$ > $X^{-1}$: pre-image of $X$ Week4 --- ### Cumulative Distribution Function (CDF) Define cumulative distribution function (cdf) of a random variable $X$ is defined as $F_X(x)=P(X\leq x)$ In particular, $P(X\in B)=\int_BdF_X(x)\text{ (Lebesgue-Stieltjes Integral)}$ #### Theory If $F(x)$ is a cdf, then: 1. $F(x)$ is **non-decreasing** on $\mathbb R$ 2. $\lim\limits_{x\rightarrow -\infty}F(x)=0$, $\lim\limits_{x\rightarrow\infty}F(x)=1$ 3. $F(x)$ is right-continuous, that is, $\lim\limits_{y\rightarrow x^+}F(y)=F(x)$ Conversely, if $F(x)$ satisfies all three properties above, then $F(x)$ is a cdf for some random variable. #### Discrete v.s Continuous 1. $X$ is called discrete random variable if $F_X(x)$ is a step function. * In particluar, $F_X(x)=\sum\limits_{k=1}^{\infty}P(X=x_k)$ for some $\{x_k\in\mathbb R:k\in\mathbb N\}$ * The function $P_X(x)=P(X=x)=\sum\limits_{k=1}^{\infty}P(X=x_k)1(x=x_k)$ is called the **probability mass function (pmf)** 2. $X$ is caled continuous random variable if $F_X(x)=\int_{-\infty}^{x}f_X(u)du$ for some $f_X(x)$ * The function $f_X(x)$ is called the **probability density function (pdf)** * $F_X(x)$ is continuous on $\mathbb R$ (必要) * $f_X(x)\geq 0\ a.e.$ * If $f$ is continuous, then $f_X(x)\geq 0$ * since $F_X(x)$ is non-decreasing, and $F_X'(x)=f_X(x)$ (FTC) * If $X$ is continuous, then $P(x\leq X < x+\Delta{x})\approx f_X(x)\Delta{x}$ > 回家複習微積分基本定理 #### Change of variable formula Let $Y=g(X)$, where $g$ is a $C^1$ and strictly increasing function, and $X$ is a continuous random variable. Then $f_Y(y)=f_X\{g^{-1}(y)\}\frac{dg^{-1}(y)}{dy}$ > $C^1$: differentiable & continuous after diff. ### Expectation --- #### Observation Expected # of occurence of an event $A$ among repeated experiments of $N$ times is $N\times P(A)$ ##### Observartion 1 Let $X(\omega)=1(\omega\in A),\ \forall\omega\in\Omega$ > $X$ 做一次試驗，紀錄 $A$ 發生的次數 Then define $E(X)=E\{I(\omega\in A)\}=P(A)$ ##### Observartion 2 Let $Y=aX$ Then $E(Y)=a\times P(A)=aE(X)$ > 賭博ㄉ概念 ##### Observartion 3 Let $A, B$ are two events. $\Rightarrow Z(\omega)=aI(\omega\in A)+bI(\omega\in B)$ Then $E(Z)=aP(A)+bP(B)=aE(X)+bE(Y)$ > $X=I(\omega\in A),\ Y=I(\omega\in B),\ \forall \omega\in\Omega$ --- #### Finite Suppose $\Omega=\{\omega_1,\omega_2,...,\omega_K\}$ General random variable $X$ can be expressed as ($X(\omega_k)=a_k,\ \forall\omega_k\in\Omega$) or $X(\omega)=\sum\limits_{k=1}^{K}a_k1(\omega=\omega_k)$, $\forall\omega\in\Omega$ $E(X)=\sum\limits^{K}_{k=1}a_kP(\{\omega_k\})$ #### Integral Representation (Lebesgue Integral) $E(X)=\sum\limits^{K}_{k=1}X(\omega_k)P(\{\omega_k\})=\int_\Omega XdP$ #### Integral Representation (Stieltjes Integral) $x_1,x_2,...,x_{k'}$: the disjont values of $\{a_1,a_2,...,a_k\}$ $E(X)=\sum\limits^{K'}_{k=1}x_kP(X=x_k)=\sum\limits^{K'}_{k=1}x_k\{F_X(x_k)-F_X(x^-_k)\}=\int_\mathbb RxdF_X(x)=\int_\mathbb R xf_X(x)dx$ > $P(X=x_k)=P(X\leq x_k) - P(X<x_k)=F_X(x_k)-\lim\limits_{x\rightarrow x^-_k}F_X(x)$ > 實數完備性定理 -> 左極限存在 ### Definition The expection of a random variable $X$ is defined as $E(X)=\int_\Omega XdP=\int_\mathbb RxdF_X(x)$ 1. If $X$ is discrete with pmf $p_X(x)=\sum\limits^{\infty}_{k=1}p_k1(x=x_k)$, then $E(X)=\sum\limits^{\infty}_{k=1}x_kp_k$ 2. If $X$ is continuous with pdf $f_X(x)$, then $E(X)=\int_\mathbb R xf_X(x)dx$ #### Theory $E(X)$ minimizes $g(a)=E\{(X-a)^2\}$ over $a\in\mathbb R$, $g:\mathbb R\longrightarrow \{0\}\cup\mathbb R^+$ e.g. $E(X)=\mathop{\arg\min\limits _a}E\{(X-a)^2\}$ ### Variance $E[\{X-E(X)\}^2]\triangleq\mathop{Var}(X)$ #### Theory 1. $Var(aX)=a^2Var(X)$ 2. $Var(X)=0\iff P(\{X=E(X)\})=1$ ### Moment generating function 1. $\forall n\in\mathbb N$, $E(X^n)$ is called **n-th moment** 2. $\forall n\in\mathbb N$, $E[\{X-E(X)\}^n]$ is called the **n-th central moment** 3. $M_X(t)=E(e^{tX})$ is called the **moment generating function**, provided that $E(e^{tX})<\infty$ > Moment is a function of CDF > Laplace Transform #### Theory > Need to review 1. If $M_X(t)$ existsm them $E(X^n)=\frac{d^nM_X(t)}{dt^n}\vert_{t=0}$ 2. If $X$ and $Y$ are **bounded**, then $F_X(u)=F_Y(u),\ \forall u\in\mathbb R$ iff $E(X^n)=E(Y^n),\ \forall n\in\mathbb R$ 3. If $M_X(t)=M_Y(t)$ on $t\in[-h,h]$ for some $h>0$, then $F_X(u)=F_Y(u),\ \forall u\in\mathbb R$ ### Characteristic function The characteristic function of a random variable $X$ is defined as: $\varphi(t)=E\{e^{itX}\}=E\{\cos(tX)+i\sin(tX)\}$ > Both $\cos(tX)$ and $\sin(tX)$ are bounded. > 傅立葉變換 #### Theory (Inversion formula) > 待補 Week 6 --- ### Random vectors Let $\mathcal B^p$ be the Borel $\sigma$-algebra on $\mathbb R^p$. Let $(\Omega,\mathcal F,\mathcal B)$ be a probability space $A$ map $\vec{X}=(x_1,x_2,...,x_p)^T:\ \Omega\longrightarrow\mathbb R^p$ is called a random vector if $\vec{X}^{-1}(B)\in\mathcal F,\ \forall B\in\mathcal B^p$ * Note: $\vec{X}$ is a random vector $\iff$ $x_1,...,x_p$ are random variables ### CDF of random vector Let $\vec{X}=(x_1,x_2,...,x_p)^T$ be a random vector. The (joint) cumulative distribution function of $\vec{X}$ is defined as: $F_{\vec{X}}(\vec{x})=P(X1\leq x_1,...,X_p\leq x_p)$ * $F_{\vec{X}}(\vec{x})=P(\vec{X}\leq \vec{x})$ (component-wise smaller) * $P(\vec{X}\in\mathcal B)=\int_B dF_{\vec{X}}(\vec{x}),\ \forall B \in\mathcal B$ * $F_{X_k}(x)=F_{\vec{X}}(\infty,...,x_k,...,\infty),\ k=1,...,p$ ### Discrete & Continuous 1. If there exists $\vec{x_1},\vec{x_2},...\in\mathbb R^p$ s.t. $F_{\vec{X}}(\vec{x})=\sum_{\vec{x}_k}P(\vec{X}=\vec{x}_k)$, where $\vec{x}_k$ is component-wise smaller than $\vec{x}$, then $\vec{X}$ is called a **discrete random vector**, and $p_{\vec{X}}(\vec{x})=P(\vec{X}=\vec{x})$ is called the (joint) probability mass function. 2. If there exists $f_{\vec{X}}(\vec{x})$ s.t. $F_{\vec{X}}(\vec{x})=\int^{x_1}_{-\infty}...\int^{x_p}_{-\infty}f_{\vec{X}}(\vec{u})d\vec{u}$, where $\vec{x}=(x_1,...,x_p)^T$, then $\vec{X}$ is called a **continuous random vector** and $f_{\vec{X}}(\vec{x})$ is called the (joint) pdf. ### Expectation Let $\vec{X}=(X_1,...,X_p)$ be a random vector. The expectation of $\vec{X}$ is defined as $\mathbb E(\vec{X})=(\mathbb E(X_1),...,\mathbb E(X_p))^T$ ### Moment generating function The moment generating function of $\vec{X}$ is $M_{\vec{X}}(\vec{t})=\mathbb E(e^{\vec{t}^T\vec{X}})$ > 唯一決定 CDF ### Characteristic function The characteristic function of $\vec{X}$ is $\varphi_{\vec{X}}(\vec{t})=\mathbb E(e^{i\vec{t}^T\vec{X}})$ ### Inversion Formula Let $\vec{X}$ be a random vector. Then, $P(a_1<x_1<b_1,...,a_p<x_p<b_p)+\frac{1}{2}\sum\limits^{P}_{k=1}\{P(X_k=a_k)+P(X_k=b_k)\}\\=\lim\limits_{T\rightarrow\infty}\frac{1}{(2\pi)^p}\int^{T}_{-T}...\int^{T}_{-T}\prod\limits^{P}_{k=1}\frac{\exp(-it_ka_k)-\exp(-it_kb_k)}{it_k}\varphi(t_1,...,t_p)dt_1...dt_p$ ### Indenpence of random variables $X_1,...,X_p$ are (mutually) independent, if $P(X_1\in\mathcal B_,...,X_p\in\mathcal B_p)=P(X_1\in\mathcal B_1)...P(X_p\in\mathcal B_p),\ \forall$ (Borel) sets $\mathcal B_1,...,\mathcal B_p$ #### Theory $X_1,...,X_p$ are independent iff $F_{\vec{X}}(\vec{x}=(x_1,...,x_p))=F_{X_1}(x_1)F_{X_2}(x_2)...F_{X_p}(x_p),\ \forall x_1,...,x_p$ #### Corollary 1. If $\vec{X}$ is discrete, then $X_1,...,X_p$ are independent iff $P_{\vec{X}}(x_1,...,x_p)=P_{X_1}(x_1)...P_{X_p}(x_p),\ \forall x_1,...,x_p$ 2. If $\vec{X}$ is continuous, then $x_1,...,x_p$ are independent iff $f_{\vec{X}}(\vec{x})=f_{X_1}(x_1)...f_{X_p}(x_p),\ \forall x_1,...,x_p$ ### Independence & Expectation $x_1,...,x_p$ are independent $\Longrightarrow$ $\mathbb E\{g_1(X_1)...g_p(X_p)\}=\mathbb E\{g_1(X_1)\}...\mathbb E\{g_p(X_p)\}$ ### Covariance The covariance of $X,\ Y$ is defined by $E(XY)-E(X)E(Y)\triangleq cov(X,Y)$. Note: * $g_1(X)=X,\ g_2(Y)=y$ * Covariance larger -> (linear) dependence larger. * But Covariance = 0 cannot imply independence. #### Theory 1. $cov(X,Y)=E[\{X-E(X)\}\{Y-E(Y)\}]$ 2. $cov(X,X)=var(X)$ 3. $cov(aX,Y)=acov(X,Y)=cov(X,aY)$ 4. $cov(aX+bY,cX+dY)=acCov(X,X)+adCov(X,Y)+bcCov(Y,X)+bcCov(Y,Y)$ 5. $var(aX+bY)=a^2var(X)+b^2var(Y)+2abCov(X,Y)$ 6. (Cauchy-Schwarz Inequality) $cov(X,Y)^2\leq var(X)var(Y)$ * The inequality hold iff $Y=aX+b$ with probaability are some deterministic constant. * **Correlation** of $X$ and $Y$: $Cor(X,Y)\triangleq\frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}}$ ### Variance & Covariance Matrix of Random Vectors For random vector $\vec{X}$: $var(\vec{x})\triangleq\mathbb E[\{\vec{X}-\mathbb E(\vec{X})\}\{\vec{X}-\mathbb E(\vec{X})\}^T]=\begin{pmatrix} cov(x_i,x_j)\end{pmatrix}_{p\times p}=\mathbb E(\vec{X}\vec{X}^T)-\mathbb E(\vec{X})\mathbb E(\vec{X})^T$ $cov(\vec{X}, \vec{Y})\triangleq\mathbb E(\vec{X}\vec{Y}^T)-\mathbb E(\vec{X})\mathbb E(\vec{Y})^T=\begin{pmatrix}cov(x_i,y_j)\end{pmatrix}_{p\times q}$ Week 7 --- ### Conditional distribution Let $X$ $Y$ be two random variables. 1. If $(X, Y)^T$ is discrete, then the conditional probability mass function of $X$ given $Y=y$ is $P_{X\vert Y}(x\vert y)=P(X=x\vert Y=y)=\frac{P_{X,Y}(x, y)}{P_Y(y)}$ * If $P_{X\vert Y}(x\vert y)=P_X(x),\ \forall\ x,\ y \iff X\perp Y$ * conditional cdf = $\sum\limits_{x_k\leq x}P(X=x_k\vert Y=y)$ 2. If $(X, Y)^T$ is continuous, then the conditional probability density function of $X$ given $Y=y$ is $f_{X\vert Y}(x\vert y)=\frac{f_{X,Y}(x, y)}{f_Y(y)}$ * If $f_{X\vert Y}(x\vert y)=f_X(x),\ \forall\ x,\ y \iff X\perp Y$ * conditional cdf: $F_{X\vert Y}(x\vert y) = \int_{-\infty}^{x}f_{X\vert Y}(u\vert y)du$ * The defintiion can be used when $P(Y=y) = 0$, e.g. when the conditional density function is defined. > If $\mathbb P(Y=y)=0$, intuitively $\mathbb P(X\leq x|Y=y)=\lim\limits_{\Delta{x}\rightarrow 0}\mathbb P(X\leq x|y\leq Y<y+\Delta{y})$ ### Conditional Expectation 1. If $(X,Y)^T$ is discrete, then $\mathbb E(X\vert Y=y)=\sum\limits_{x}x\mathbb P(X=x\vert Y=y)$ 2. If $(X,Y)^T$ is continuous, then $\mathbb E(X\vert Y=y)=\int_{-\infty}^{\infty}xf_{X\vert Y}(x\vert y)dx$ We denote $\mathbb E(X\vert Y)=g(Y)$, where $g(y)=E(X\vert Y=y)$ ### Conditionl variance $var(X\vert Y=y)\triangleq\mathbb E[\{X-\mathbb E(X\vert Y=y)\}^2\vert Y=y]$ #### Proposition 1. $\mathbb E\{g(X,Y)\vert Y=y\}=\mathbb E\{g(X,y)\vert Y=y\}$. In particlar, $\mathbb E\{g_1(X)g_2(Y)\vert Y=y\}=\mathbb E\{g_1(X)\vert Y=y\}g_2(y)$ 2. $\mathbb E\{\mathbb E(X\vert Y)\}=\mathbb E(X)$ 3. $var(X)=\mathbb E\{var(X\vert Y)\}+var\{\mathbb E(X\vert Y)\}$ #### Theory $\mathbb E(X\vert Y=y)$ minimizes $E[\{X-g(Y)\}^2]$ over $g$ (Chebyshow) $P(|X-\mathbb E(X)|\geq \epsilon)\leq\frac{var(X)}{\epsilon^2},\ \forall\epsilon > 0$ --- tags: 統計 title: Note --- Week 7 --- --- ### $\S$ Common families of distribution #### Discrete Uniform Distribution $\Omega=\{1,...,N\}$, $\mathcal F=2^\Omega$ $P(\{k\})=\frac{1}{N},\ \forall k=1,...,N$ Consider $X(\omega)=\omega$ $P(X=k)=\begin{cases} \frac{1}{N} & \text{If k = 1,..,N}\\ 0 & \text{Otherwise} \end{cases}$ $\mathbb E(X)=?$ $var(X)=?$ #### Hypergeometric Distribution Scenario: N球(K黑球, N-K白球)取n球 $|\Omega|=\dbinom{N}{n}$ $P(\{\omega\})=\frac{1}{\binom{N\ }{n}},\ \forall\omega\in\Omega$ Let $X$ be the number of black balls then: $P(X=k)=C^{K}_{k}C^{N-K}_{n-k}/C^{N}_{n},\ \text{for k = 0, 1, 2, ..., min(K, n)}$ ##### Record with order $|\Omega|=P^{N}_{n}$, $P(\{\omega\})=1/P^{N}_{n}$ Let $X$ be the number of black balls then: $P(X=k)=\dbinom{n}{k}P^{K}_{k}P^{N-K}_{n-k}/P^{N}_{n}=\dbinom{K}{k}\dbinom{N-K}{n-k}/\dbinom{N}{n}$ $E(X)=\sum\limits_{k}kP(X=k) \\=\sum\limits^{n\land K}_{k=1}k\frac{\dbinom{K}{k}\dbinom{N-K}{n-k}}{\dbinom{N}{n}}\\ =\sum\limits^{n\land K}_{k=1}K\frac{\dbinom{K-1}{k-1}\dbinom{N-K}{n-k}}{\dbinom{N}{n}} \\=K\sum\limits^{n\land K}_{k=1}\frac{\dbinom{K-1}{k-1}\dbinom{N-K}{n-k}}{\dbinom{N-1}{n-1}}\times\frac{\dbinom{N-1}{n-1}}{\dbinom{N}{n}}=\frac{K}{N}\times n\ (K\times 1\times\frac{1}{N}\times n)$ > note: $a\land b=\mathop{min(a, b)}$ #### Bernoulli Distribution Defintion: $X$ follows a Bernoulli Distribution if $P_X(x)=p^x(1-p)^{1-x}\text{ for x = 0, 1, where }p\in [0,1]$ We denote $X\backsim Bern(p)\iff P(X=1)=p,\ P(X=0)=(1-p)$ #### Binomial Distribution Suppose that $Y_1,...,Y_n$ are independent and identically distributed (i.i.d) from $Bern(p)$ Let $X=\sum\limits^{n}_{i=1}Y_i$ $\Rightarrow P(X=x)=\binom{n}{x}p^x(1-p)^{n-x}$, for $x=0,1,...,n$ We call $X$ being binomial distributed and denote $X\backsim Binom(n,p)$ --- $X\backsim \text{hyperGeometric}(N,K,n)$ i.e. $P(X=x)=\binom{n}{x}\frac{K^x(N-K)^{n-x}}{N^n}$ --- #### Poison Distribution > Poison process: > Senario: 統計從開店到關店，客人的總數 > method: 把時間區間切成$n$等分，其中每個區段分為有或沒有出現客人，區間內出現客人的機率和區間長成正比，每個區間可以視為$Bern(\frac{\lambda}{n})$ > 所以posion distribution可以當作是$n$個i.i.d的Bernulli Let $N$ be the counting process on $[0,1]$ s.t. 1. $\forall 0\leq s_1<t_1\leq s_2<t_2\leq 1$, $N(t_1)-N(s_1)$, $N(t_2)-N(s_2)$ are independent. 2. $\forall 0<t<1$, $P\{N(\text{t-th})-N(t)=1\}=\lambda h+$ Consider a partition $0<\frac{1}{n}<...<\frac{n-1}{n} < 1$ then: $N(\frac{i}{n})-N(\frac{i-1}{n}),\ (i=1,...,n)$ are i.i.d. from $Bern(\frac{\lambda}{n})$ i.e.$P(N(1)=x)=\binom n x\frac{\lambda}{n}^x(1-\frac{\lambda}{n})^{n-k}$ when $n\longrightarrow\infty$, $P\longrightarrow\frac{\lambda^x}{x!}e^{-\lambda}$ for all $x\in\mathbb N\cup\{0\}$ and the distribution is called posion distribution. $E(X)=\sum\limits^{\infty}_{x=0}x\frac{\lambda^x}{x!}e^{-\lambda}=\lambda\sum\limits^{\infty}_{x=1}\frac{\lambda^{x-1}}{(x-1)!}e^{-\lambda}=\lambda$ #### Geometric Distribution Let $\{Y_i:i\in\mathbb N\}$ be a sequence of i.i.d. $Bern(p)$ random variables. Let $X=\mathop{min}\{k\in\mathbb Z:k\geq 0,Y_{k+1}=\sum\limits^{k+1}_{i=1}Y_i=1\}$, which is the number of failures before the first success. $P(X=x)=(1-p)^xp$, $\forall x\in\{0\}\cup\mathbb N$ We call that $X$ follows a geometric distribution. #### Negative Binomial Distribution Let $X$ be the number of failures before the $r$-th success. The $X\backsim\mathop{NegativeBinomial}(r,p)$ $P(X=x)=\binom{r+x-1}{r-1}(1-p)^xp^r$, for $x\in\{0\}\cup\mathbb N$ --- ### Continuous Distribution Families #### Uniform Distribution **Definition** $X$ follows a uniform distributions if $f_X(X;a,b)=\frac{1}{b - a}1(a\leq x \leq b)$, where $a<b$ are two parameters. We denote $X\backsim\mathop{Unif(a,b)}$ **Theory** Let $X$ be a continuous random variable with density function $f_X(x)$ and cdf $F_X(x)$, then $F_X(x)=U\backsim\mathop{Unif(0,1)}$ > Suppose $F_X$ is strickly increasing on some interval, then inverse exists on that interval. $\mathbb P(F_X(X)\leq x)=\mathbb P(X\leq F_X^{-1}(x))=F_X(F_X^{-1}(x))=x$ for $x\in [0,1]$ > Use Uniform Distribution to generate any distribution with its CDF strickly increasing. > $X=F_X^{-1}(U)$ -> Generate $U\backsim\mathop{Unif(0,1)}$ -> Calculate $F^{-1}(x)$ -> Let $X=F^{-1}(U)$ -> $X\backsim F(x)$ #### Exponential Distribution In the dependent of Poisson distribution, we can extend the considered time period to $[0,t]$. Let $N(t)$ be the numebr of customers in $[0,t]$ (denoted by $Binom(n,\frac{\lambda}{n}t)$ approximate to $Poisson(\lambda t)$) Let $T_1$ be the time of the first success, then $P(T_1>t)=P(N(t)=0)=exp(-\lambda t)$ for $t>0$ $\Rightarrow$ CDF of $T_1=F_{T_1}(t)=1-e^{-\lambda t}$ for $t\in [0,\infty]$ $\Rightarrow f_{T_1}(t)=\lambda e^{-\lambda t}$ for $t\in [0,\infty)$ $\Rightarrow$ We says $T_1$ follows exponential distribution. denoted by $T_1\backsim\mathop{Exp(\lambda)}$ $\beta=\frac{1}{\lambda}$ is called the scale parameter. (When $\beta$ is smaller, then the time of the first success becomes shorter.) ##### Combination of two Exponential Distribution Let $X_1\backsim\mathop{Exp(\lambda)}$, $X_2\backsim\mathop{Exp(\lambda)}$, $X_1, X_2$ are independent, then: $\mathbb P(X_1+X_2\leq x)=\int_{0}^{\infty}\int_{0}^{x-x_2}\lambda e^{-\lambda x_1}\lambda e^{-\lambda x_2}dx_1dx_2$ $\Rightarrow f_X(x)=\lambda^2xe^{-\lambda x}$ for $x>0$ ##### Theory If $X_i\backsim\mathop{Exp(\lambda)},\ i=1,...,n$ and $\{X_1,...,X_n\}$ are mutually independent, then $Y=\sum\limits_{i=1}^{n}X_i$ has pdf $\frac{1}{\lambda}\int_{0}^{\infty}\frac{\lambda^{n+1}}{(n-1)!}x^{(n+1)-1}e^{-\lambda x}dx=\frac{\lambda^n}{(n-1)!}x^{n - 1}e^{-\lambda x}$ for $x>0$ $\Rightarrow$ CDF $F_X(x)=\int_{0}^{\infty}\lambda^n x^{n-1} e^{-\lambda x}dx\triangleq\Gamma(n)=(n-1)!$ We call that $Y$ follows a Gamma Distribution, denoted by $X\backsim\mathop{Gamma(n,\lambda)}$ #### Weibull Distribution Let $X\backsim\mathop{Exp(\lambda)}$, consider $Y=X^{\frac{1}{\gamma}}$, then $Y$ follows a Weibull Distribution, denoted by $Y\backsim\mathop{Weibull(\lambda,\gamma)}$ #### Normal Distribution **Definition** $X$ follows a normal distribution with mean $\mu$ and variance $\sigma^2$ if $f_X(x)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$ > Expectance = ? Variance = ? MGF = ? Characteristic function = ? ##### Combination of Normal Distributions $X_1\backsim\mathop{N(\mu_1,\sigma^2_1)}$, $X_2\backsim\mathop{N(\mu_2,\sigma^2_2)}$, $X_1,X_2$ are independent. $\Rightarrow X_1 + X_2\backsim\mathop{N(\mu_1+\mu_2,\sigma^2_1+\sigma^2_2)}$ $M_{X_1+X_2}(t)=E(e^{t(X_1+X_2)})=E(e^{tX_1}e^{tX_2})=E(e^{tX_1})E(e^{tX_2})=e^{\mu_1 t+\frac{\sigma^2_1}{2}t^2}e^{\mu_2 t+\frac{\sigma^2_2}{2}t^2}=e^{(\mu_1+\mu_2)t+\frac{\sigma^2_1 + \sigma^2_2}{2}t^2}$ (When indep.) **Definition** $\vec{X}\in\mathbb R^p$ follows a multivariate normal distribution if $\vec{a}^T\vec{X}$ follows a univariate normal distirbution. $\forall\vec{a}\in\mathbb R^p$ **Theory** If $\vec{X}$ follows a multivariate normal distribution, then: $f_{\vec{X}}(\vec{x};\mu,\Sigma)=\frac{1}{(2\pi)^{p/2}det(\Sigma)^{1/2}}e^{-\frac{(\vec{x}-\mu)^T\Sigma^{-1}(\vec{x}-\mu)}{2}}$ > $\Sigma$ have to be a semi positive-definite matrix **Theory** If $\vec{X}\backsim N_p(\mu,\Sigma)$ and $A$ be a $g\times p$ deterministic matrix of full rank, then: $A\vec{X}+\vec{b}\backsim N_q(A\mu+\vec{b},A\Sigma^{-1}A^T)$ **Theory** Let $(X,Y)^T\backsim N_2(\begin{pmatrix}\mu_x\\\mu_Y\end{pmatrix}, \begin{pmatrix} \sigma^2_x & \rho\sigma_x\sigma_Y \\ \rho\sigma_X\sigma_Y & \sigma^2_Y\\ \end{pmatrix})$ , then: $X|Y=y\backsim\mathop{N(\mu_X+(y-\mu_Y)\rho\sigma_X/\sigma_Y,\ \sigma_X^2(1-\rho^2))}$ > If correlation = 0, then variables in multivariate normal distribution are independent. > Proof 2d is easier. #### Chi-Square Distribution If $\vec{Z}\backsim N_p(0,I_p)$, then $X=\vec{Z}^T\vec{Z}$ follows a $\chi^2_p$ distrbution, In particular, $f_X(x;p)=\frac{x^{p/2-1}e^{-x/2}}{2^{p/2}\Gamma(p/2)}1(x>0)$ i.e. $\chi^2_p\equiv\Gamma(p/2,1/2)$ #### Student's t-distribution $Z\backsim\mathop{N(0,1)}$, $Y\backsim\chi^2_p$, and $Z,Y$ are indep. $\Rightarrow X=\frac{Z}{(Y/p)^{1/2}}$ follows a $t_p$ distribution. ### $\S$ Random Sample **Definition** A collection of random variables/vectors $\{\vec{X}_i\}^n_{i=1}$ is called a random sample if $\vec{X_1},...,\vec{X_n}$ are i.i.d (independent, identically distributed) from a distribution $F_{\vec{X}}(\vec{x})$ Let $\{X_i\}^n_{i=1}$ be a random sample and $T(x_1,...,x_n)$ be a function. $T(X_1,...,X_n)$ (another random variable) is called a "statistic" if it contains no unknowned parameter. (except for $x_1,...,x_n$ (實現值)) The distribution of $T(X_1,...,X_n)$ is called the sampling distribution. e.g. * $\frac{x_1+x_2+...+x_n}{n}\triangleq\bar{X}$: Sample Mean * $\frac{1}{n-1}\sum\limits^{n}_{i=1}(X_i-\bar{X})^2\triangleq S^2$: Sample Variance $\mathbb E(\bar{X})=\mathbb E(\frac{1}{n}\sum\limits^{n}_{i=1}X_i) = \frac{1}{n}\sum\limits^{n}_{i=1}\mathbb E(X_i)=\frac{1}{n}\sum\limits^{n}_{i=1}\mu=\mu$ **(Sample Mean = Population Mean)** > $\mathbb E(X_i) = \mu,\ \forall X_i$ (since i.i.d) $var(\bar{X})=var(\frac{1}{n}\sum\limits^{n}_{i=1}X_i)=\frac{1}{n^2}var(\sum\limits^{n}_{i=1}X_i)=\frac{1}{n^2}(\sum\limits^{n}_{i=1}var(X_i)+2\sum\limits_{i<j}cov(X_i, X_j))\\=\frac{1}{n^2}\sum\limits^{n}_{i=1}var(X_i)=\frac{\sigma^2}{n}\rightarrow0$ as $n\rightarrow\infty$ > $cov(X_i, X_j)=0$, since i.i.d > When $n\rightarrow\infty$, sample mean $\bar{X}\rightarrow\mu$ (population mean) $\mathbb E(S^2)=\mathbb E\{\frac{1}{n - 1}\sum\limits^{n}_{i=1}(X_i - \bar{X})^2\}=\frac{1}{n - 1}\sum\limits^{n}_{i=1}\mathbb E\{(X_i-\bar{X})^2\}\\=\frac{1}{n - 1}\sum\limits^{n}_{i = 1}\mathbb E\{(X_i - \frac{1}{n} \sum\limits^{n}_{i = 1}X_i)^2\}=?$ --- Let $\vec{X}=\begin{pmatrix}X_1 \\ X_2 \\ ...\\ X_n\end{pmatrix} \Rightarrow \bar{X}=n^{-1}1^T_n\vec{X}$ $\Rightarrow\vec{X}-\bar{X}=\vec{X}-\bar{X}1_{n}=\vec{X}-n^{-1}1_n 1^T_n \vec{X}=(\mathbb I_n - n^{-1}1_n1^T_n)\vec{X}$ $\Rightarrow\sum\limits^{n}_{i=1}(X_i-\bar{X})^2\\ =\{(\mathbb I_n - n^{-1}1_n1^T_n)\vec{X}\}^T\{(\mathbb I_n - n^{-1}1_n1^T_n)\vec{X}\}\\ =\vec{X}^T\{(\mathbb I_n - n^{-1}1_n1^T_n)\}^T\{(\mathbb I_n - n^{-1}1_n1^T_n)\}\vec{X}\\ =\vec{X}^T\{(\mathbb I_n - n^{-1}1_n1^T_n)\}\vec{X}$ $\Rightarrow S^2=\frac{1}{n-1}\vec{X}^T\{(\mathbb I_n - n^{-1}1_n1^T_n)\}\vec{X}$ $\Rightarrow\mathbb E(S^2)=\mathbb E\{\frac{1}{n-1}\vec{X}^T\{(\mathbb I_n - n^{-1}1_n1^T_n)\}\vec{X}\}\\ =\frac{1}{n - 1}\mathbb E\{\vec{X}^T\{(\mathbb I_n - n^{-1}1_n1^T_n)\}\vec{X}\}\\ =\frac{1}{n - 1}[\mu^2 1^T_n (\mathbb I_n-n^{-1} 1_n 1 ^T_n)1_n + \sigma^2 \mathop{tr}\{(\mathbb I_n-n^{-1} 1_n 1 ^T_n)\mathbb I_n\}]\\ =\frac{1}{n - 1} (n - 1)\sigma^2 = \sigma^2$ > $\mathbb E(\vec{X}^TA\vec{X})=\mathbb E(\vec{X})^T A \mathbb E(\vec{X}) + \mathop{tr}\{A \mathop {var}(\vec{X})\}$ > ?????? How to proof? > $\mathbb E(\vec{X})=\begin{pmatrix}\mu \\ \mu \\ ... \\ \mu\end{pmatrix}=\mu 1_n$ > $var(\vec{X})=\sigma^2 \mathbb I_n$ --- > Note: If $S^2=\frac{1}{n-1}\sum\limits^{n}_{i=1}(X_i - \bar{X})^2$, then $\mathbb E(S^2)=\sigma^2$ #### Theory Suppose that $X_i \sim \mathcal{N}(\mu, \sigma^2)$, then: > 去問要不要會證這個 == 1. $\bar{X}$ and $S^2$ are independent. 2. $\bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})$ 3. $n^{1/2}(\bar{X} - \mu)/S \sim t_{n-1}$ 4. $(n - 1)S^2/\sigma^2 \sim \chi^2_{n - 1}$ #### Recall Chabyshew's Inequality: $\mathbb P(|X - \mathbb E(X)| \geq \epsilon) \leq \frac{var(X)}{\epsilon}$ $\Rightarrow\mathbb P(|\bar{X} - \mu| \geq \epsilon) \leq \frac{\sigma^2}{n\epsilon}$ #### Definition A sequence of random variables $\{Y_n:\ n\in\mathbb N\}$ is said to converge in prob to $Y$, denoted by $Y_n\xrightarrow{P}Y$, if $\lim\limits_{n\rightarrow\infty}\mathbb P(|\bar{X} - \mu| \geq \epsilon) = 0$, $\forall\epsilon > 0$ #### Theory (Weak law of large number) $\bar{X}\xrightarrow{P}\mu=\mathbb E(X_i)$, $\forall X_i$ #### Theory (Central limit theory) $n^{1/2}\frac{\bar{X} - \mu}{\sigma}\xrightarrow{d}\mathop{N(0, 1)}$ ##### Definition A sequence of random variables $\{X'_n : n \in\mathbb N\}$ is said to converge in distribution to $X$ if $\lim\limits_{n \rightarrow \infty}F_{X_n}(x)=F_X(x)$, $\forall x \text{ s.t. } F_X(x) \text{ is continuous}$ > $X_n$ is the sequence of the random variables. > Proof > $\frac{n^{1/2}(\bar{X}-\mu)}{\sigma}=\frac{1}{\sqrt{n}}\sum\limits^{n}_{i=1}(\frac{X_i - \mu}{\sigma})=\sum\limits^{n}_{i=1}(\frac{X_i - \mu}{\sqrt{n} \sigma})=\sum^{n}_{i=1}Z_i$ > $\Rightarrow \psi_{Z}(t)\\=\mathbb E(e^{it\sum^{n}_{i=1}Z_i})\\=\prod\limits_{i=1}^{n}\mathbb E(e^{itZ_i})=\prod\limits_{i=1}^{n}\psi_{Z_i}(t)\\\approx [1+\frac{it}{\sqrt{n}\sigma}\mathbb E(\bar{X}-\mu) - \frac{t^2}{2\sigma^2}\mathbb E\{(\bar{X} - \mu) ^2\}]^n \\ = (1 - \frac{t^2}{2n\sigma^2}\sigma^2)^n \rightarrow e^{-\frac{t^2}{2}}$ (mgf of normal distribution) > 泰勒展開式 2-nd order approximation, expanded at x = 0 #### Theory Suppose that $X_n \xrightarrow{P} X$, $Y_n \xrightarrow{P} Y$, then: 1. $a X_n + b Y_n \xrightarrow{P} a X + b Y$ 2. $X_n Y_n \xrightarrow{P} X Y$ 3. If $g$ is continuous, then $g(X_n) \xrightarrow{P} g(X)$ * $X_n \xrightarrow{d} X \iff \mathbb E\{g(X_n)\}\xrightarrow{P} \mathbb E\{g(X)\}$, for each bounded continuous $g$ * If $X_n \xrightarrow{d} X$ and $g$ is continuous, then $g(X_n)\xrightarrow{d}g(X)$ #### Theory (Slutsky) Suppose that $X_n\xrightarrow{d}X$ and $Y_n\xrightarrow{P}c$, where $c$ is a constant. 1. $X_n + Y_n \xrightarrow{d} X + c$ 2. $X_n Y_n \xrightarrow{d} cX$ #### Theory (First order delta method) Suppose that $n^{1/2}(X_i - \mu) \xrightarrow{d} N(0, \sigma^2)$ (asympotic normality) and $g$ is a $C^1$ function. Then: $n^{1/2}\{g(X_n) - g(\mu)\} \xrightarrow{d} N[0, \{g'(u)\}^2\sigma^2]$ > Note: $g(X_n) - g(\mu) \approx g'(u)(X_n - \mu)$ ### $\S$ Point Estimation #### Definition 1. A statistical model is a collection of distribution functions. (cdf, pmf, pdf) 2. Let $\mathcal M$ be a statistical model. If $\mathcal M$ can be indexed by $\theta \in \mathbb R^p$, i.e. $\mathbb M=\{F(X : \theta):\theta \in \Theta \subset \mathbb R^p \}$, then $\mathcal M$ is called a parametric method, $\theta$ is called the parameter, and $\Theta$ is called the parameter space. > The parameter $\theta$ is called identifiable if the map: $\theta \mapsto F(X : \theta)$ is injective (one-to-one) > Ex. for a circle with radius $r$ and parameter $\theta$, if $\Theta \in \mathbb R$, then there will be multuple value $\theta$ that maps to the same point. (e.g. $0, 2\pi$ maps to the same point) 3. A statistical parameters is a function $\theta: \mathcal M \mapsto \Theta$ > Ex. Mean, Variance 4. A point estimator is a statistic that specifies a point in a parameter space based on the observed data. #### Common Point Estimator By Weak Law of Large Number, $\bar{X}$ can serve as a point estimator. > e.g. $X_1, ..., X_n \sim Bernoulli(p)$, $p \in [0, 1]$, let $p_0$ be the true parameter. > According to WLLN, $\bar{X} \xrightarrow{P} \mathbb E(X)=p_0$, so we can use sample mean $\bar{X}$ to estimate $p_0$ > e.g. $X_1, ..., X_n \sim Exp(\lambda)$, $\lambda > 0$. Let $\lambda_0$ be the true parameter. > $\bar{X} \xrightarrow{P} \frac{1}{\lambda_0}$ > $\hat{\lambda} = \bar{X}^{-1} \xrightarrow{P} \lambda_0$ > e.g. $X_1, ..., X_n \sim Unif(0, a)$. Let $a_0$ be the true parameter. > $\bar{X} \xrightarrow{P} \frac{a_0}{2}$ > $\hat{a} = 2 \bar{X} \xrightarrow{P} a_0$ > e.g. $X_1, ..., X_n \sim Unif(a, b)$. Let $a_0, b_0$ be the true parameter. > $\bar{X} \xrightarrow{P} \frac{a_0 + b_0}{2}$ > $\frac{1}{n}\sum\limits^{n}_{i=1}X_i^2 \xrightarrow{P} \frac{(b_0 - a_0) ^ 2}{12} + (\frac{a_0 + b_0}{2})^2$ > 解方程式 #### Moment Estimators Let $\mathcal M = \{F(X : \theta) : \theta \in \Theta \subset \mathbb R^p\}$, for each $\theta \in \Theta$, the $k$-th moment is $\mu_k(\theta)=\int x^k dF(X : \theta)$. By WLLN, $n^{-1}\sum\limits^{n}_{i=1}X_i^k \xrightarrow{P} \mu_k(\theta_0)$ > Proof? Let $\vec{Y}_i = (X_i, X_i^2, ..., X_i^p)$ and $\vec{\mu}(\theta)=(\mu_1(\theta, \mu_2(\theta), ..., \mu_p(\theta))$ Then $n^{-1}\sum\limits^{n}_{i=1}\vec{Y_i} \xrightarrow{P} \vec{\mu}(\theta_0)$ The moment estimator is defined as $\vec{\mu}^{-1}(n^{-1}\sum\limits^{n}_{i=1}\vec{Y_i})$. #### Definition Let $\hat{\theta}$ be an estimator for parameter $\theta \in \mathbb R$, $\hat{\theta}$ is called **consistent** estimator if $\hat{\theta} \xrightarrow{P} \theta_0$ Some commonly used criteria: 1. The bias of $\hat{\theta}$ is defined as $\mathbb E(\hat{\theta} ; \theta)-\theta$ (a funtion of $\theta$), $\hat{\theta}$ is called unbiased if $\mathbb E(\hat{\theta} ; \theta) - \theta = 0,\ \forall \theta \in \Theta$ 2. $var(\hat{\theta} ; \theta)$ 3. MSE: $\mathbb E\{(\hat{\theta} - \theta) ^2 ; \theta\}$ #### Maximum Likehood Estimator --- e.g. $X_1, X_2 \sim Bernoulli(p),\ p \in [0, 1]$. Realizations: $X_1 = 1, X_2 = 0$ Question: $p = 0.1 \lor p=0.2$ ? $\Rightarrow P(X_1 = 1, X_2 = 0, p = 0.1) = 0.1 \times 0.9 = 0.09$ $\Rightarrow P(X_1 = 1, X_2 = 0, p = 0.2) = 0.2 \times 0.8 = 0.16$ (Choose $p = 0.2$ to have a larger prob. to ovserve the realization of $(1, 0)$) Let realization of $X_1 = x_1$, and realization of $X_2 = x_2$ $\Rightarrow P(X_1 = x_1, X_2 = x_2 ; p) = p^{x_1} (1 - p)^{1 - x_1} \times p^{x_2} (1 - p) ^{1 - x_2} = p^{x_1 + x_2} (1 - p)^{2 - (x_1 + x_2)} = f(p)$, $p \in [0, 1]$ $\Rightarrow \log f(p) = (x_1 + x_2)\log p + \{2 - (x_1 + x_2)\}\log{(1 - p)}$ $\Rightarrow \frac{d}{dp} \log{f(p)} = \frac{x_1 + x_2}{p} + \frac{2 - (x_1 + x_2)}{1 - p} = 0,\ p=\frac{x_1 + x_2}{2}$ $\Rightarrow \log f(0) = 0,\ \log f(1) = 0$ $\therefore f(p)$ has maximum when $p = \frac{x_1 + x_2}{2}$ > Recap: 極值發生在: 1) 一階導數=0 2) 一階導數不存在的地方 3) 邊界 $\Rightarrow \hat{p}=\frac{x_1 + x_2}{2}$ (MLE) --- e.g. Let $X_1, ..., X_n \sim Bernoulli(p)$ (i.i.d), $p \in [0, 1]$ $\Rightarrow f(p) = P(X_1 = x_1, ..., X_n = x_n ; p) = \prod\limits^{n}_{i=1} p^{x_i} (1 - p)^{1 - x_i} = p^{\sum\limits_{i} x_i}(1 - p)^{n - \sum\limits_{i} x_i}$ $\Rightarrow \log f(p) = (\sum\limits_{i} x_i) \log p + (n - \sum\limits_{i} x_i) \log (1 - p)$ $\Rightarrow \frac{d}{dp} f(p) = \frac{\sum\limits_{i} x_i}{p} + \frac{n - \sum\limits_{i} x_i}{1 - p} = 0,\ p = \bar{X}$ $\Rightarrow f(0) = f(1) = 0$ $\therefore$ when $p=\bar{X}$, $f(p)$ has maximum. $\Rightarrow$ MLE: $\hat{\theta}=\bar{X}=\frac{1}{n}\sum\limits^{n}_{i=1} X_i$ --- **待補** ### $\S$ Hypothesis Test ##### Definition A hypothesis is a statement about a population parameter, say $\theta$ 1. Null hypothesis: $H_0 : \theta \in \Theta_0$ 2. Alternative hypothesis: $H_A : \theta \in \Theta_A$ > Note: $\Theta_0 \bigcup \Theta_A$ may not equal to $\Theta$ A hypthesis test is a rule only based on a sample $\vec{X}$ to decide whether to reject $H_0$ (hence accept $H_A$) A hypothesis can be formulated as a test function $\varphi(\vec{x}) = \mathbb P(\text{reject } H_0 | \vec{X} = \vec{x})$, or simply $\varphi(\vec{x}) = \mathbb P(\text{reject } H_0 | \vec{X})$ or $\varphi$ > Note: $\vec{X} = (X_1, ..., X_n)$ i.e. $\varphi(\vec{x})$ and $\varphi$ are treated as statistics (and hence random variable) 1. If $\varphi(\vec{x}) = 1(\vec{x} \in C)$, it's called a non-randomized test. 2. For a non-randomized test, $C$ is called the rejection region. --- Ex. $X_1, ..., X_n \sim Bernoulli(p)$ (i.i.d) $\bar{X} = \frac{1}{n} \sum\limits_{i=1}^{n} X_i$ fair : $p_0 = 0.5$ (Null hypothesis) If $\bar{X} = 0.6$, is it fair? If $\bar{X} = 0.52$, is it fair? Usually use the rejection region to help to determine whether the assumption is true. i.e. Reject $p_0 = 0.5$ if $|\bar{X} - 0.5| > C$ Rejection area: $\{|\bar{X} - 0.5| > C\}$ while $C$ becomes larger: $P(|\bar{X} - 0.5| > C ; p = 0.5) = E(\varphi ; p = 0.5)$ becomes smaller $P(|\bar{X} - 0.5| > C ; p = 0.6) = E(\varphi ; p = 0.5)$ also becomes smaller There's no a common $C$ such that prob. of making right decision can be maximized and that of making wrong decision can be minimized. --- | | $H_0$ is true | $H_0$ is false | | ------------ | ------------- | -------------- | | Reject $H_0$ | Type I Error | Power | | Accept $H_0$ | | Type II Error | > Note: Power and Type II error may be a collection of value. --- Ex. $X_1, X_2, ..., X_n \sim N(\mu, 1)$ (i.i.d), where $\mu$ is an unknown parameter. $\begin{cases} H_0: \mu = 0 \\ H_A : \mu \neq 0\end{cases}$ 1. Find a test statistic $\bar{X}$ 2. Determine the shape of rejection region: $\varphi = 1(|\bar{X}| > C)$ 3. Set significance level $\alpha = 0.05$ * Type I error: $P(|\bar{X}| > C ; \mu = 0)$ * Power: $P(|\bar{X}| > C ; \mu = \mu^* \neq 0)$ Type I Error: $\because \bar{X} \sim N(0, \frac{1}{n})$ $\Rightarrow \sqrt{n} \bar{X} \sim N(0, 1)$ $\Rightarrow P(|\bar{X}| > C ; \mu = 0) = P(|\sqrt n \bar X| > \sqrt n C ; \mu = 0) = 2 \Phi(-\sqrt n C) \leq 0.05$ $\Rightarrow$ We can choose $C$ s.t. $2\Phi(-\sqrt n C) = 0.05$ i.e. $C = -\Phi^{-1}(\frac{0.05}{2})/\sqrt n$ $\varphi = 1(|\bar{X}| > -\Phi^{-1}(\frac{0.05}{2})/\sqrt n)$ --- Ex. $X_1, X_2, ..., X_n \sim N(\mu, 1)$ (i.i.d), where $\mu$ is an unknown parameter. $\begin{cases} H_0: \mu = 0 \\ H_A : \mu > 0\end{cases}$ 1. Find a test statistic $\bar{X}$ 2. Determine the shape of rejection region: $\varphi = 1(\bar{X} > C)$ 3. Set significance level $\alpha = 0.05$ * Type I error: $P(\bar{X} > C ; \mu = 0)$ * Power: $P(\bar{X} > C ; \mu = \mu^* > 0)$ Type I Error: $\because \bar{X} \sim N(0, \frac{1}{n})$ $\Rightarrow \sqrt{n} \bar{X} \sim N(0, 1)$ $\Rightarrow P(\bar{X} > C ; \mu = 0) = P(\sqrt n \bar X > \sqrt n C ; \mu = 0) = 1 - \Phi(\sqrt n C) \leq \alpha$ $\Rightarrow$ We can choose $C$ s.t. $1 - \Phi(\sqrt n C) = \alpha$ i.e. $C = \Phi^{-1}(1 - \alpha)/\sqrt n$ $\varphi = 1(\bar{X} > -\Phi^{-1}(1 - \alpha)/\sqrt n)$ --- Ex. $X_1, X_2, ..., X_n \sim N(\mu, 1)$ (i.i.d), where $\mu$ is an unknown parameter. $\begin{cases} H_0: \mu \leq 0 \\ H_A : \mu > 0\end{cases}$ 1. Find a test statistic $\bar{X}$ 2. Determine the shape of rejection region: $\varphi = 1(\bar{X} > C)$ 3. Set significance level $\alpha = 0.05$ * Type I error: $P(\bar{X} > C ; \mu = \mu_0 \leq 0)$ * Power: $P(\bar{X} > C ; \mu = \mu^* > 0)$ both decrease in $C$ Type I Error: $P(\bar{X} > C ; \mu = \mu_0 \leq 0) \leq \alpha,\ \forall \mu_0 \leq 0 \equiv \max\limits_{\mu_0 \leq 0} P(\bar{X} > C; \mu=\mu_0) \leq \alpha$ i.e right tail's area, given $x > C$ $\Rightarrow$ maximum occurs when $\mu_0 = 0$ > Under $\mu = \mu_0 \leq 0,\ \bar X \sim N(\mu_0, \frac{1}{n})$ $\therefore \max\limits_{\mu_0 \leq 0} P(\bar{X} > C; \mu=\mu_0) \leq \alpha \equiv P(\bar X > C ; \mu = 0) \leq \alpha$ $\Rightarrow \varphi = 1(\bar X > \Phi^{-1}(1 - \alpha)/\sqrt n)$ > Composite Hypothesis: The hypothesis contains more than one parameter > Simple Hypothesis: The hypothesis contains only one parameter --- Ex. $X_1, X_2, ..., X_n \sim N(\mu, 1)$ $\begin{cases} H_0 : \mu = 0 & \text{(simple)}\\ H_A : \mu = 1 & \text{(simple)}\end{cases}$ likelihood ratio $\mathop{(L.R.)}$ test: $1\{\frac{f_{\vec{X}}(\vec{X} ; \mu = 0)}{f_{\vec{X}}(\vec{X} ; \mu = 1)} \leq C\}$, typically $C$ should be smaller than $1$ to satisfy the level alpha test $f_{\vec{X}}(\vec{X} ; \mu = 0) = (2\pi)^{-\frac{n}{2}}e^{-\frac{1}{2}\sum^{n}_{i=1} X_i^2}$ $f_{\vec{X}}(\vec{X} ; \mu = 1) = (2\pi)^{-\frac{n}{2}}e^{-\frac{1}{2}\sum^{n}_{i=1} (X_i - 1)^2}$ $\Rightarrow \mathop{L.R.} = e^{\frac{1}{2}\{\sum^n_{i=1}(X_i - 1)^2 - \sum^n_{i=1} X_i^2\}}$ $\Rightarrow \varphi = 1(\sum^n_{i=1}(X_i - 1)^2 - \sum^{n}_{i=1}X_i^2 \leq C^*) = 1(\bar X \geq C^{**})$ (Generalized ver.) (i). $\begin{cases} H_0 : \mu = 0 & \text{(simple)}\\ H_A : \mu > 0 & \text{(composite)}\end{cases}$ $\mathop{G.L.R} = \frac{f_{\vec{X}}(\vec{x} ; \mu = 0)}{\max\limits_{\mu \geq 0} f_{\vec{X}}(\vec{x} ; \mu)} \leq C$ > for denominator: $\theta\in\Theta_0\bigcup\Theta_A$ (ii). $\begin{cases} H_0 : \mu \leq 0 & \text{(composite)}\\ H_A : \mu > 0 & \text{(composite)}\end{cases}$ $\mathop{G.L.R} = \frac{\max\limits_{\mu \leq 0} f_{\vec{X}}(\vec{x} ; \mu)}{\max\limits_{\mu} f_{\vec{X}}(\vec{x} ; \mu)} \leq C$ > for denominator: $\theta\in\Theta_0\bigcup\Theta_A$ --- Ex. $X_1, X_2, ..., X_n \sim \text{Geomeric}(p)$ (i.i.d), where $p$ is an unknown parameter. $\begin{cases} H_0: p = 0.01 \\ H_A: p < 0.01 \end{cases}$ $\hat{L}(p) = \prod\limits^{n}_{i = 1}(1-p)^{X_i}p = (1-p)^{\sum^{n}_{i=1}X_i}p^n$ $\Rightarrow \mathop{GLR} = \frac{\max\limits_{p \in \Theta_0} \hat{L}(p)}{\max\limits_{p \in \Theta_0 \cup \Theta_A} \hat{L}(p)} = \frac{\hat{L}(0.01)}{\max\limits_{p \leq 0.01} \hat{L}(p)}$ $\Rightarrow$ Rejection Region: $\{\frac{\hat{L}(0.01)}{\max\limits_{p \leq 0.01} \hat{L}(p)} \leq C\}$ 1. Determine $\max\limits_{p \leq 0.01} \hat L(p)$ (微分以及討論邊界) $\Rightarrow \max\limits_{p \leq 0.01} \hat L(p) = \begin{cases} \hat L(0.01) & \text{if } 0.01 \leq (\bar X + 1)^{-1} \\ \hat L((X + 1)^{-1}) & \text{Otherwise} \end{cases}$ > 剩下看手寫的 --- #### Definition Let $\mathcal T$ be a class of tests for testing $H_0: \theta_0 \in \Theta_0 \mathop{vs.} H_A: \theta_A \in \Theta_A$ A test $\varphi \in \mathcal T$ is called a uniformly most powerful (**UMP**) class $\mathcal T$ test if $\mathbb E(\varphi ; \theta_A) \geq \mathbb E(\varphi^* ; \theta_A) ,\ \forall \theta_A \in \Theta_A \text{ and } \varphi^* \in \mathcal T$ #### Theory (Neyman-Pearson's lemma) Suppose $\mathcal M=\{f(x;\theta):\theta \in \Theta\}$ is a statistical model. Consider $H_0: \theta=\theta_0\ vs. \ H_A: \theta=\Theta_A$ For a fixed $k$, define $C=\{x: f(x ; \theta_A) > k f(x ; \theta_0)\}$ and $C_T=\{x: f(x; \theta_A) = k f(x;\theta_0)\}$ Let $\alpha_l=P(X \in C ; \theta_0)$ and $\alpha_u=p(X \in C \cup C_T; \theta_0)$ Then the NP likelihood ratio test $\varphi(x)=1(x\in C) + \frac{\alpha - \alpha_l}{\alpha_u - \alpha_l}1(x \in C_T)$ is a UMP level $\alpha$ test.