---
tags: 統計
title: Note - Probability
---
week1
---
## Ch1 - Intro
* What is data?
* Something we care about for a sepcific problem.
* What is data science?
* Solving problems using data.
* Statistical Inference: sample data to form the population to answer make inference for a specific problem.
* Descriptive Static: Just simplify the view of data. (e.g. avg, median, ...)
* Questions being of interest:
* Questions w/ randomness (ex. weather forecast)
* ex.1: Stochastic phenomenon of population
* ex.2: Sampling w/ uncertainty
## Ch2 - Probability Theory
### Term Definition
* sample space $\Omega$: the set contains all possible outcomes of an experiment.
* event: subset of $\Omega$
* ex. 投擲一枚公正硬幣2次:
* $\Omega=\{正正, 正反, 反正, 反反\}$
* $P(出現一次反面) = P(\{正反, 反正\})$
* Statement: only true or false
* Proposition: 若 $p$ 則 $q$,其中 $p,\ q$ 為 statement。
* $\lnot p\Rightarrow$ proposition is always true。
* Long-run relative probability (x)
* Personal-probability interpretation
Axiom of probability
1. $P(A)\in [0,1], \forall \text{event A}$
2. $P(\Omega)=1$
3. Finite additive axiom:
* If $\{A_k|k=1,...,K\}$ are mutually disjoint, then $P(\bigcup\limits^{K}_{k=1}A_k)=\sum\limits^{K}_{k=1}P(A_k)$
* Note: [2] & [3] can imply $p(\phi)=0$
* Laplace: Only need to define the probability of each single outcome, then probabilities of all event are defined.
* Reason: these outcomes forms the partition of $\Omega$, and they are mutually disjoint.
week2
----------
### Countable
When $|\Omega|<\infty$, say $\Omega=\{w_1,...,w_k\}$, $P$ can be simply defined by the following procedure:
1. Define $P(\{w_k\})=p_k$, with $p_k\geq 0$ and $\sum\limits^{K}_{k=1}p_k=1$
2. $\forall\text{ event A}$, define $P(A)=\sum\limits^{K}_{k=1}p_k1(w_k\in A)$
* $1()$ is a indicator function, if the condition is true, return 1, else return 0
* $P:2^\Omega\longrightarrow[0,1]$ or $P:A\longmapsto\sum\limits^{K}_{k=1}p_k1(w_k\in A)$
When $\Omega$ is countable, say $\Omega=\{w_1,w_2,...\}$, we can define $P$ in a similar way.
1. Define $P(\{w_k\})=p_k$, with $p_k\geq 0$ and $\sum\limits^{\infty}_{k=1}p_k=1$
* $\lim\limits_{K\rightarrow\infty}\sum\limits^{K}_{k=1}p_k=1$
2. $\forall\text{ event A}$, define $P(A)=\sum\limits^{\infty}_{k=1}p_k1(w_K\in A)$
* Countable: 可以把集合內的元素以某種排序方式排列 -> 可以驗機率和是否是1
* We have promised that $P(\bigcup\limits^{\infty}_{k=1}A_k)=\sum\limits^{\infty}_{k=1}P(A_k)$, for disjoint events $A_1, A_2,...$
#### Countable version of the Axiom of Probability
1. $P(A)\in [0,1], \forall A \subset \Omega$
2. $P(\Omega)=1$
3. Countable additive axiom:
* If $A_1,A_2,...\subset\Omega$ are mutually disjoint, then $P(\bigcup\limits^{\infty}_{k=1}A_k)=\sum\limits^{\infty}_{k=1}P(A_k)$
* Note: [2] & [3] can imply $p(\phi)=0$
### Uncountable
When $\Omega$ is uncoubtable, such $P$ may (does) not exist on $2^\Omega$.
Alternative solution
* Define $P$ for **some** but many enough events. At least, $\Omega$, $\phi$, $\{w_k\}$ include.
* If $A$ has prob, then $A^c$ has prob.
* If $A_1,A_2$ have prob, them $\bigcup\limits^{\infty}_{k=1}A_k$ has prob.
Definition: A collection $\mathcal F$ of events is called a $\sigma$-algebra if:
1. $\phi\in\mathcal F$
2. $A^c\in F$ if $A\in\mathcal F$
3. $\bigcup\limits^{\infty}_{k=1}A_k\in\mathcal F$ if $A_1,A_2,...\in\mathcal F$
* By 2. 3., $\bigcap\limits^{\infty}_{k=1}A_k$ has prob.
* $(\bigcap\limits^{\infty}_{k=1}A_k)^c=\bigcup\limits^{\infty}_{k=1}A_k^c$
#### Axioms of Probability
1. $P(A)\geq 0$, $\forall A\in\mathcal F$
2. $P(\Omega)=1$
3. $P(\bigcup\limits^\infty_{k=1}A_k)=\sum\limits^{\infty}_{k=1}$ if $A_1,A_2,...\in\mathcal F$ is **disjoint**.
* $(\Omega,\mathcal F,\mathbb P)$ is called a **probability space**.
### Conditinal Probability
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space.
Let $A,B\in\mathcal F$ and $P(B)>0$
Then conditional probability of $A$ given $B$ is defined as: $P(A\vert B)=\frac{P(A\cap B)}{P(B)}$
Week3
---
### Dependent/Independent
Let $(\Omega,\mathcal F, \mathbb P)$ be a probability space.
1. $A,B\in\mathcal F$ are **independent**, if $P(A\cap B)=P(A)P(B)$
2. A collection $\{A_k\in\mathcal F:k\in \mathbb N\}$ is mutually independent, if $P(\bigcap\limits^{k}_{j=1}A_{i_j})=\prod\limits^{k}_{j=1}P(A_{i_j}),\ \forall1\leq i_1<...<i_k$
* $i_j$: j-th index of $A$
> 已知$P(A\vert B)$, 求$P(B\vert A)$?
>
> $P(A\vert B)=\frac{P(A\cap B)}{P(B)}=\frac{P(A\cap B)}{P(B\cap(A\cup A^c))}=\frac{P(A\cap B)}{P((B\cap A)\cup(B\cap A^c))}=\frac{P(A\cap B)}{P(B\cap A)+P(B\cap A^c)}$
>
> 又$P(B\vert A)=\frac{P(A\cap B)}{P(A)}\Rightarrow P(A\cap B)=P(B\vert A)\times P(A)$
>
> $\Rightarrow P(A\vert B)=\frac{P(B\vert A)\times P(A)}{P(B\vert A)\times P(A) + P(B\vert A^c)\times P(A^c)}$
>
> $P(B\vert A^c), P(A^c)$ 需已知
### Partition
Let $\Omega$ be a sample space. A collection of events $\{C_k\subset\Omega:k\in\mathbb N\}$ is called a partition, if
1. $\bigcap\limits^{\infty}_{k=1}C_k=\Omega$
2. $C_k$'s are disjoint
### Bayes Rule
Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space and $A\in\mathcal F$ with $P(A)>0$. Let $\{C_k\in\mathcal F:k\in\mathbb N\}$ be a partition.
Then $P(C_k\vert A)=\frac{P(A\vert C_k)P(C_k)}{\sum\limits^{\infty}_{k=1}P(A\vert C_i)P(C_i)}$
### Random Variable
The random variable, usually denoted by $X$, can be viewed as a function mapping $\Omega$ to $\mathbb R$. ($X:\Omega\rightarrow\mathbb R$ and $X^{-1}(A)\in\mathcal F,\ \forall A\in\mathcal B$)
> $\{X\geq 1\}=\{\omega\in\Omega:X(\omega)\in[1,\infty)\}=X^{-1}([1,\infty))$
> $X^{-1}$: pre-image of $X$
Week4
---
### Cumulative Distribution Function (CDF)
Define cumulative distribution function (cdf) of a random variable $X$ is defined as $F_X(x)=P(X\leq x)$
In particular, $P(X\in B)=\int_BdF_X(x)\text{ (Lebesgue-Stieltjes Integral)}$
#### Theory
If $F(x)$ is a cdf, then:
1. $F(x)$ is **non-decreasing** on $\mathbb R$
2. $\lim\limits_{x\rightarrow -\infty}F(x)=0$, $\lim\limits_{x\rightarrow\infty}F(x)=1$
3. $F(x)$ is right-continuous, that is, $\lim\limits_{y\rightarrow x^+}F(y)=F(x)$
Conversely, if $F(x)$ satisfies all three properties above, then $F(x)$ is a cdf for some random variable.
#### Discrete v.s Continuous
1. $X$ is called discrete random variable if $F_X(x)$ is a step function.
* In particluar, $F_X(x)=\sum\limits_{k=1}^{\infty}P(X=x_k)$ for some $\{x_k\in\mathbb R:k\in\mathbb N\}$
* The function $P_X(x)=P(X=x)=\sum\limits_{k=1}^{\infty}P(X=x_k)1(x=x_k)$ is called the **probability mass function (pmf)**
2. $X$ is caled continuous random variable if $F_X(x)=\int_{-\infty}^{x}f_X(u)du$ for some $f_X(x)$
* The function $f_X(x)$ is called the **probability density function (pdf)**
* $F_X(x)$ is continuous on $\mathbb R$ (必要)
* $f_X(x)\geq 0\ a.e.$
* If $f$ is continuous, then $f_X(x)\geq 0$
* since $F_X(x)$ is non-decreasing, and $F_X'(x)=f_X(x)$ (FTC)
* If $X$ is continuous, then $P(x\leq X < x+\Delta{x})\approx f_X(x)\Delta{x}$
> 回家複習微積分基本定理
#### Change of variable formula
Let $Y=g(X)$, where $g$ is a $C^1$ and strictly increasing function, and $X$ is a continuous random variable.
Then $f_Y(y)=f_X\{g^{-1}(y)\}\frac{dg^{-1}(y)}{dy}$
> $C^1$: differentiable & continuous after diff.
### Expectation
---
#### Observation
Expected # of occurence of an event $A$ among repeated experiments of $N$ times is $N\times P(A)$
##### Observartion 1
Let $X(\omega)=1(\omega\in A),\ \forall\omega\in\Omega$
> $X$ 做一次試驗,紀錄 $A$ 發生的次數
Then define $E(X)=E\{I(\omega\in A)\}=P(A)$
##### Observartion 2
Let $Y=aX$
Then $E(Y)=a\times P(A)=aE(X)$
> 賭博ㄉ概念
##### Observartion 3
Let $A, B$ are two events.
$\Rightarrow Z(\omega)=aI(\omega\in A)+bI(\omega\in B)$
Then $E(Z)=aP(A)+bP(B)=aE(X)+bE(Y)$
> $X=I(\omega\in A),\ Y=I(\omega\in B),\ \forall \omega\in\Omega$
---
#### Finite
Suppose $\Omega=\{\omega_1,\omega_2,...,\omega_K\}$
General random variable $X$ can be expressed as ($X(\omega_k)=a_k,\ \forall\omega_k\in\Omega$)
or $X(\omega)=\sum\limits_{k=1}^{K}a_k1(\omega=\omega_k)$, $\forall\omega\in\Omega$
$E(X)=\sum\limits^{K}_{k=1}a_kP(\{\omega_k\})$
#### Integral Representation (Lebesgue Integral)
$E(X)=\sum\limits^{K}_{k=1}X(\omega_k)P(\{\omega_k\})=\int_\Omega XdP$
#### Integral Representation (Stieltjes Integral)
$x_1,x_2,...,x_{k'}$: the disjont values of $\{a_1,a_2,...,a_k\}$
$E(X)=\sum\limits^{K'}_{k=1}x_kP(X=x_k)=\sum\limits^{K'}_{k=1}x_k\{F_X(x_k)-F_X(x^-_k)\}=\int_\mathbb RxdF_X(x)=\int_\mathbb R xf_X(x)dx$
> $P(X=x_k)=P(X\leq x_k) - P(X<x_k)=F_X(x_k)-\lim\limits_{x\rightarrow x^-_k}F_X(x)$
> 實數完備性定理 -> 左極限存在
### Definition
The expection of a random variable $X$ is defined as $E(X)=\int_\Omega XdP=\int_\mathbb RxdF_X(x)$
1. If $X$ is discrete with pmf $p_X(x)=\sum\limits^{\infty}_{k=1}p_k1(x=x_k)$, then $E(X)=\sum\limits^{\infty}_{k=1}x_kp_k$
2. If $X$ is continuous with pdf $f_X(x)$, then $E(X)=\int_\mathbb R xf_X(x)dx$
#### Theory
$E(X)$ minimizes $g(a)=E\{(X-a)^2\}$ over $a\in\mathbb R$, $g:\mathbb R\longrightarrow \{0\}\cup\mathbb R^+$
e.g. $E(X)=\mathop{\arg\min\limits _a}E\{(X-a)^2\}$
### Variance
$E[\{X-E(X)\}^2]\triangleq\mathop{Var}(X)$
#### Theory
1. $Var(aX)=a^2Var(X)$
2. $Var(X)=0\iff P(\{X=E(X)\})=1$
### Moment generating function
1. $\forall n\in\mathbb N$, $E(X^n)$ is called **n-th moment**
2. $\forall n\in\mathbb N$, $E[\{X-E(X)\}^n]$ is called the **n-th central moment**
3. $M_X(t)=E(e^{tX})$ is called the **moment generating function**, provided that $E(e^{tX})<\infty$
> Moment is a function of CDF
> Laplace Transform
#### Theory
> Need to review
1. If $M_X(t)$ existsm them $E(X^n)=\frac{d^nM_X(t)}{dt^n}\vert_{t=0}$
2. If $X$ and $Y$ are **bounded**, then $F_X(u)=F_Y(u),\ \forall u\in\mathbb R$ iff $E(X^n)=E(Y^n),\ \forall n\in\mathbb R$
3. If $M_X(t)=M_Y(t)$ on $t\in[-h,h]$ for some $h>0$, then $F_X(u)=F_Y(u),\ \forall u\in\mathbb R$
### Characteristic function
The characteristic function of a random variable $X$ is defined as:
$\varphi(t)=E\{e^{itX}\}=E\{\cos(tX)+i\sin(tX)\}$
> Both $\cos(tX)$ and $\sin(tX)$ are bounded.
> 傅立葉變換
#### Theory (Inversion formula)
> 待補
Week 6
---
### Random vectors
Let $\mathcal B^p$ be the Borel $\sigma$-algebra on $\mathbb R^p$.
Let $(\Omega,\mathcal F,\mathcal B)$ be a probability space $A$ map $\vec{X}=(x_1,x_2,...,x_p)^T:\ \Omega\longrightarrow\mathbb R^p$ is called a random vector if $\vec{X}^{-1}(B)\in\mathcal F,\ \forall B\in\mathcal B^p$
* Note: $\vec{X}$ is a random vector $\iff$ $x_1,...,x_p$ are random variables
### CDF of random vector
Let $\vec{X}=(x_1,x_2,...,x_p)^T$ be a random vector.
The (joint) cumulative distribution function of $\vec{X}$ is defined as: $F_{\vec{X}}(\vec{x})=P(X1\leq x_1,...,X_p\leq x_p)$
* $F_{\vec{X}}(\vec{x})=P(\vec{X}\leq \vec{x})$ (component-wise smaller)
* $P(\vec{X}\in\mathcal B)=\int_B dF_{\vec{X}}(\vec{x}),\ \forall B \in\mathcal B$
* $F_{X_k}(x)=F_{\vec{X}}(\infty,...,x_k,...,\infty),\ k=1,...,p$
### Discrete & Continuous
1. If there exists $\vec{x_1},\vec{x_2},...\in\mathbb R^p$ s.t. $F_{\vec{X}}(\vec{x})=\sum_{\vec{x}_k}P(\vec{X}=\vec{x}_k)$, where $\vec{x}_k$ is component-wise smaller than $\vec{x}$, then $\vec{X}$ is called a **discrete random vector**, and $p_{\vec{X}}(\vec{x})=P(\vec{X}=\vec{x})$ is called the (joint) probability mass function.
2. If there exists $f_{\vec{X}}(\vec{x})$ s.t. $F_{\vec{X}}(\vec{x})=\int^{x_1}_{-\infty}...\int^{x_p}_{-\infty}f_{\vec{X}}(\vec{u})d\vec{u}$, where $\vec{x}=(x_1,...,x_p)^T$, then $\vec{X}$ is called a **continuous random vector** and $f_{\vec{X}}(\vec{x})$ is called the (joint) pdf.
### Expectation
Let $\vec{X}=(X_1,...,X_p)$ be a random vector. The expectation of $\vec{X}$ is defined as $\mathbb E(\vec{X})=(\mathbb E(X_1),...,\mathbb E(X_p))^T$
### Moment generating function
The moment generating function of $\vec{X}$ is $M_{\vec{X}}(\vec{t})=\mathbb E(e^{\vec{t}^T\vec{X}})$
> 唯一決定 CDF
### Characteristic function
The characteristic function of $\vec{X}$ is $\varphi_{\vec{X}}(\vec{t})=\mathbb E(e^{i\vec{t}^T\vec{X}})$
### Inversion Formula
Let $\vec{X}$ be a random vector. Then,
$P(a_1<x_1<b_1,...,a_p<x_p<b_p)+\frac{1}{2}\sum\limits^{P}_{k=1}\{P(X_k=a_k)+P(X_k=b_k)\}\\=\lim\limits_{T\rightarrow\infty}\frac{1}{(2\pi)^p}\int^{T}_{-T}...\int^{T}_{-T}\prod\limits^{P}_{k=1}\frac{\exp(-it_ka_k)-\exp(-it_kb_k)}{it_k}\varphi(t_1,...,t_p)dt_1...dt_p$
### Indenpence of random variables
$X_1,...,X_p$ are (mutually) independent, if $P(X_1\in\mathcal B_,...,X_p\in\mathcal B_p)=P(X_1\in\mathcal B_1)...P(X_p\in\mathcal B_p),\ \forall$ (Borel) sets $\mathcal B_1,...,\mathcal B_p$
#### Theory
$X_1,...,X_p$ are independent iff $F_{\vec{X}}(\vec{x}=(x_1,...,x_p))=F_{X_1}(x_1)F_{X_2}(x_2)...F_{X_p}(x_p),\ \forall x_1,...,x_p$
#### Corollary
1. If $\vec{X}$ is discrete, then $X_1,...,X_p$ are independent iff $P_{\vec{X}}(x_1,...,x_p)=P_{X_1}(x_1)...P_{X_p}(x_p),\ \forall x_1,...,x_p$
2. If $\vec{X}$ is continuous, then $x_1,...,x_p$ are independent iff $f_{\vec{X}}(\vec{x})=f_{X_1}(x_1)...f_{X_p}(x_p),\ \forall x_1,...,x_p$
### Independence & Expectation
$x_1,...,x_p$ are independent $\Longrightarrow$ $\mathbb E\{g_1(X_1)...g_p(X_p)\}=\mathbb E\{g_1(X_1)\}...\mathbb E\{g_p(X_p)\}$
### Covariance
The covariance of $X,\ Y$ is defined by $E(XY)-E(X)E(Y)\triangleq cov(X,Y)$.
Note:
* $g_1(X)=X,\ g_2(Y)=y$
* Covariance larger -> (linear) dependence larger.
* But Covariance = 0 cannot imply independence.
#### Theory
1. $cov(X,Y)=E[\{X-E(X)\}\{Y-E(Y)\}]$
2. $cov(X,X)=var(X)$
3. $cov(aX,Y)=acov(X,Y)=cov(X,aY)$
4. $cov(aX+bY,cX+dY)=acCov(X,X)+adCov(X,Y)+bcCov(Y,X)+bcCov(Y,Y)$
5. $var(aX+bY)=a^2var(X)+b^2var(Y)+2abCov(X,Y)$
6. (Cauchy-Schwarz Inequality) $cov(X,Y)^2\leq var(X)var(Y)$
* The inequality hold iff $Y=aX+b$ with probaability are some deterministic constant.
* **Correlation** of $X$ and $Y$: $Cor(X,Y)\triangleq\frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}}$
### Variance & Covariance Matrix of Random Vectors
For random vector $\vec{X}$:
$var(\vec{x})\triangleq\mathbb E[\{\vec{X}-\mathbb E(\vec{X})\}\{\vec{X}-\mathbb E(\vec{X})\}^T]=\begin{pmatrix} cov(x_i,x_j)\end{pmatrix}_{p\times p}=\mathbb E(\vec{X}\vec{X}^T)-\mathbb E(\vec{X})\mathbb E(\vec{X})^T$
$cov(\vec{X}, \vec{Y})\triangleq\mathbb E(\vec{X}\vec{Y}^T)-\mathbb E(\vec{X})\mathbb E(\vec{Y})^T=\begin{pmatrix}cov(x_i,y_j)\end{pmatrix}_{p\times q}$
Week 7
---
### Conditional distribution
Let $X$ $Y$ be two random variables.
1. If $(X, Y)^T$ is discrete, then the conditional probability mass function of $X$ given $Y=y$ is $P_{X\vert Y}(x\vert y)=P(X=x\vert Y=y)=\frac{P_{X,Y}(x, y)}{P_Y(y)}$
* If $P_{X\vert Y}(x\vert y)=P_X(x),\ \forall\ x,\ y \iff X\perp Y$
* conditional cdf = $\sum\limits_{x_k\leq x}P(X=x_k\vert Y=y)$
2. If $(X, Y)^T$ is continuous, then the conditional probability density function of $X$ given $Y=y$ is $f_{X\vert Y}(x\vert y)=\frac{f_{X,Y}(x, y)}{f_Y(y)}$
* If $f_{X\vert Y}(x\vert y)=f_X(x),\ \forall\ x,\ y \iff X\perp Y$
* conditional cdf: $F_{X\vert Y}(x\vert y) = \int_{-\infty}^{x}f_{X\vert Y}(u\vert y)du$
* The defintiion can be used when $P(Y=y) = 0$, e.g. when the conditional density function is defined.
> If $\mathbb P(Y=y)=0$, intuitively $\mathbb P(X\leq x|Y=y)=\lim\limits_{\Delta{x}\rightarrow 0}\mathbb P(X\leq x|y\leq Y<y+\Delta{y})$
### Conditional Expectation
1. If $(X,Y)^T$ is discrete, then $\mathbb E(X\vert Y=y)=\sum\limits_{x}x\mathbb P(X=x\vert Y=y)$
2. If $(X,Y)^T$ is continuous, then $\mathbb E(X\vert Y=y)=\int_{-\infty}^{\infty}xf_{X\vert Y}(x\vert y)dx$
We denote $\mathbb E(X\vert Y)=g(Y)$, where $g(y)=E(X\vert Y=y)$
### Conditionl variance
$var(X\vert Y=y)\triangleq\mathbb E[\{X-\mathbb E(X\vert Y=y)\}^2\vert Y=y]$
#### Proposition
1. $\mathbb E\{g(X,Y)\vert Y=y\}=\mathbb E\{g(X,y)\vert Y=y\}$.
In particlar, $\mathbb E\{g_1(X)g_2(Y)\vert Y=y\}=\mathbb E\{g_1(X)\vert Y=y\}g_2(y)$
2. $\mathbb E\{\mathbb E(X\vert Y)\}=\mathbb E(X)$
3. $var(X)=\mathbb E\{var(X\vert Y)\}+var\{\mathbb E(X\vert Y)\}$
#### Theory
$\mathbb E(X\vert Y=y)$ minimizes $E[\{X-g(Y)\}^2]$ over $g$
(Chebyshow)
$P(|X-\mathbb E(X)|\geq \epsilon)\leq\frac{var(X)}{\epsilon^2},\ \forall\epsilon > 0$
---
tags: 統計
title: Note
---
Week 7
---
---
### $\S$ Common families of distribution
#### Discrete Uniform Distribution
$\Omega=\{1,...,N\}$, $\mathcal F=2^\Omega$
$P(\{k\})=\frac{1}{N},\ \forall k=1,...,N$
Consider $X(\omega)=\omega$
$P(X=k)=\begin{cases}
\frac{1}{N} & \text{If k = 1,..,N}\\
0 & \text{Otherwise}
\end{cases}$
$\mathbb E(X)=?$
$var(X)=?$
#### Hypergeometric Distribution
Scenario: N球(K黑球, N-K白球)取n球
$|\Omega|=\dbinom{N}{n}$
$P(\{\omega\})=\frac{1}{\binom{N\ }{n}},\ \forall\omega\in\Omega$
Let $X$ be the number of black balls
then: $P(X=k)=C^{K}_{k}C^{N-K}_{n-k}/C^{N}_{n},\ \text{for k = 0, 1, 2, ..., min(K, n)}$
##### Record with order
$|\Omega|=P^{N}_{n}$, $P(\{\omega\})=1/P^{N}_{n}$
Let $X$ be the number of black balls
then: $P(X=k)=\dbinom{n}{k}P^{K}_{k}P^{N-K}_{n-k}/P^{N}_{n}=\dbinom{K}{k}\dbinom{N-K}{n-k}/\dbinom{N}{n}$
$E(X)=\sum\limits_{k}kP(X=k)
\\=\sum\limits^{n\land K}_{k=1}k\frac{\dbinom{K}{k}\dbinom{N-K}{n-k}}{\dbinom{N}{n}}\\
=\sum\limits^{n\land K}_{k=1}K\frac{\dbinom{K-1}{k-1}\dbinom{N-K}{n-k}}{\dbinom{N}{n}}
\\=K\sum\limits^{n\land K}_{k=1}\frac{\dbinom{K-1}{k-1}\dbinom{N-K}{n-k}}{\dbinom{N-1}{n-1}}\times\frac{\dbinom{N-1}{n-1}}{\dbinom{N}{n}}=\frac{K}{N}\times n\ (K\times 1\times\frac{1}{N}\times n)$
> note: $a\land b=\mathop{min(a, b)}$
#### Bernoulli Distribution
Defintion:
$X$ follows a Bernoulli Distribution if $P_X(x)=p^x(1-p)^{1-x}\text{ for x = 0, 1, where }p\in [0,1]$
We denote $X\backsim Bern(p)\iff P(X=1)=p,\ P(X=0)=(1-p)$
#### Binomial Distribution
Suppose that $Y_1,...,Y_n$ are independent and identically distributed (i.i.d) from $Bern(p)$
Let $X=\sum\limits^{n}_{i=1}Y_i$
$\Rightarrow P(X=x)=\binom{n}{x}p^x(1-p)^{n-x}$, for $x=0,1,...,n$
We call $X$ being binomial distributed and denote $X\backsim Binom(n,p)$
---
$X\backsim \text{hyperGeometric}(N,K,n)$
i.e. $P(X=x)=\binom{n}{x}\frac{K^x(N-K)^{n-x}}{N^n}$
---
#### Poison Distribution
> Poison process:
> Senario: 統計從開店到關店,客人的總數
> method: 把時間區間切成$n$等分,其中每個區段分為有或沒有出現客人,區間內出現客人的機率和區間長成正比,每個區間可以視為$Bern(\frac{\lambda}{n})$
> 所以posion distribution可以當作是$n$個i.i.d的Bernulli
Let $N$ be the counting process on $[0,1]$ s.t.
1. $\forall 0\leq s_1<t_1\leq s_2<t_2\leq 1$, $N(t_1)-N(s_1)$, $N(t_2)-N(s_2)$ are independent.
2. $\forall 0<t<1$, $P\{N(\text{t-th})-N(t)=1\}=\lambda h+$
Consider a partition $0<\frac{1}{n}<...<\frac{n-1}{n} < 1$
then: $N(\frac{i}{n})-N(\frac{i-1}{n}),\ (i=1,...,n)$ are i.i.d. from $Bern(\frac{\lambda}{n})$
i.e.$P(N(1)=x)=\binom n x\frac{\lambda}{n}^x(1-\frac{\lambda}{n})^{n-k}$
when $n\longrightarrow\infty$, $P\longrightarrow\frac{\lambda^x}{x!}e^{-\lambda}$ for all $x\in\mathbb N\cup\{0\}$
and the distribution is called posion distribution.
$E(X)=\sum\limits^{\infty}_{x=0}x\frac{\lambda^x}{x!}e^{-\lambda}=\lambda\sum\limits^{\infty}_{x=1}\frac{\lambda^{x-1}}{(x-1)!}e^{-\lambda}=\lambda$
#### Geometric Distribution
Let $\{Y_i:i\in\mathbb N\}$ be a sequence of i.i.d. $Bern(p)$ random variables.
Let $X=\mathop{min}\{k\in\mathbb Z:k\geq 0,Y_{k+1}=\sum\limits^{k+1}_{i=1}Y_i=1\}$, which is the number of failures before the first success.
$P(X=x)=(1-p)^xp$, $\forall x\in\{0\}\cup\mathbb N$
We call that $X$ follows a geometric distribution.
#### Negative Binomial Distribution
Let $X$ be the number of failures before the $r$-th success.
The $X\backsim\mathop{NegativeBinomial}(r,p)$
$P(X=x)=\binom{r+x-1}{r-1}(1-p)^xp^r$, for $x\in\{0\}\cup\mathbb N$
---
### Continuous Distribution Families
#### Uniform Distribution
**Definition**
$X$ follows a uniform distributions if $f_X(X;a,b)=\frac{1}{b - a}1(a\leq x \leq b)$, where $a<b$ are two parameters. We denote $X\backsim\mathop{Unif(a,b)}$
**Theory**
Let $X$ be a continuous random variable with density function $f_X(x)$ and cdf $F_X(x)$, then $F_X(x)=U\backsim\mathop{Unif(0,1)}$
> Suppose $F_X$ is strickly increasing on some interval, then inverse exists on that interval.
$\mathbb P(F_X(X)\leq x)=\mathbb P(X\leq F_X^{-1}(x))=F_X(F_X^{-1}(x))=x$ for $x\in [0,1]$
> Use Uniform Distribution to generate any distribution with its CDF strickly increasing.
> $X=F_X^{-1}(U)$ -> Generate $U\backsim\mathop{Unif(0,1)}$ -> Calculate $F^{-1}(x)$ -> Let $X=F^{-1}(U)$ -> $X\backsim F(x)$
#### Exponential Distribution
In the dependent of Poisson distribution, we can extend the considered time period to $[0,t]$. Let $N(t)$ be the numebr of customers in $[0,t]$ (denoted by $Binom(n,\frac{\lambda}{n}t)$ approximate to $Poisson(\lambda t)$)
Let $T_1$ be the time of the first success, then $P(T_1>t)=P(N(t)=0)=exp(-\lambda t)$ for $t>0$
$\Rightarrow$ CDF of $T_1=F_{T_1}(t)=1-e^{-\lambda t}$ for $t\in [0,\infty]$
$\Rightarrow f_{T_1}(t)=\lambda e^{-\lambda t}$ for $t\in [0,\infty)$
$\Rightarrow$ We says $T_1$ follows exponential distribution. denoted by $T_1\backsim\mathop{Exp(\lambda)}$
$\beta=\frac{1}{\lambda}$ is called the scale parameter. (When $\beta$ is smaller, then the time of the first success becomes shorter.)
##### Combination of two Exponential Distribution
Let $X_1\backsim\mathop{Exp(\lambda)}$, $X_2\backsim\mathop{Exp(\lambda)}$, $X_1, X_2$ are independent, then:
$\mathbb P(X_1+X_2\leq x)=\int_{0}^{\infty}\int_{0}^{x-x_2}\lambda e^{-\lambda x_1}\lambda e^{-\lambda x_2}dx_1dx_2$
$\Rightarrow f_X(x)=\lambda^2xe^{-\lambda x}$ for $x>0$
##### Theory
If $X_i\backsim\mathop{Exp(\lambda)},\ i=1,...,n$ and $\{X_1,...,X_n\}$ are mutually independent, then $Y=\sum\limits_{i=1}^{n}X_i$ has pdf $\frac{1}{\lambda}\int_{0}^{\infty}\frac{\lambda^{n+1}}{(n-1)!}x^{(n+1)-1}e^{-\lambda x}dx=\frac{\lambda^n}{(n-1)!}x^{n - 1}e^{-\lambda x}$ for $x>0$
$\Rightarrow$ CDF $F_X(x)=\int_{0}^{\infty}\lambda^n x^{n-1} e^{-\lambda x}dx\triangleq\Gamma(n)=(n-1)!$
We call that $Y$ follows a Gamma Distribution, denoted by $X\backsim\mathop{Gamma(n,\lambda)}$
#### Weibull Distribution
Let $X\backsim\mathop{Exp(\lambda)}$, consider $Y=X^{\frac{1}{\gamma}}$, then
$Y$ follows a Weibull Distribution, denoted by $Y\backsim\mathop{Weibull(\lambda,\gamma)}$
#### Normal Distribution
**Definition**
$X$ follows a normal distribution with mean $\mu$ and variance $\sigma^2$ if $f_X(x)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$
> Expectance = ? Variance = ? MGF = ? Characteristic function = ?
##### Combination of Normal Distributions
$X_1\backsim\mathop{N(\mu_1,\sigma^2_1)}$, $X_2\backsim\mathop{N(\mu_2,\sigma^2_2)}$, $X_1,X_2$ are independent.
$\Rightarrow X_1 + X_2\backsim\mathop{N(\mu_1+\mu_2,\sigma^2_1+\sigma^2_2)}$
$M_{X_1+X_2}(t)=E(e^{t(X_1+X_2)})=E(e^{tX_1}e^{tX_2})=E(e^{tX_1})E(e^{tX_2})=e^{\mu_1 t+\frac{\sigma^2_1}{2}t^2}e^{\mu_2 t+\frac{\sigma^2_2}{2}t^2}=e^{(\mu_1+\mu_2)t+\frac{\sigma^2_1 + \sigma^2_2}{2}t^2}$ (When indep.)
**Definition**
$\vec{X}\in\mathbb R^p$ follows a multivariate normal distribution if $\vec{a}^T\vec{X}$ follows a univariate normal distirbution. $\forall\vec{a}\in\mathbb R^p$
**Theory**
If $\vec{X}$ follows a multivariate normal distribution, then:
$f_{\vec{X}}(\vec{x};\mu,\Sigma)=\frac{1}{(2\pi)^{p/2}det(\Sigma)^{1/2}}e^{-\frac{(\vec{x}-\mu)^T\Sigma^{-1}(\vec{x}-\mu)}{2}}$
> $\Sigma$ have to be a semi positive-definite matrix
**Theory**
If $\vec{X}\backsim N_p(\mu,\Sigma)$ and $A$ be a $g\times p$ deterministic matrix of full rank, then:
$A\vec{X}+\vec{b}\backsim N_q(A\mu+\vec{b},A\Sigma^{-1}A^T)$
**Theory**
Let $(X,Y)^T\backsim N_2(\begin{pmatrix}\mu_x\\\mu_Y\end{pmatrix}, \begin{pmatrix}
\sigma^2_x & \rho\sigma_x\sigma_Y \\
\rho\sigma_X\sigma_Y & \sigma^2_Y\\
\end{pmatrix})$ , then:
$X|Y=y\backsim\mathop{N(\mu_X+(y-\mu_Y)\rho\sigma_X/\sigma_Y,\ \sigma_X^2(1-\rho^2))}$
> If correlation = 0, then variables in multivariate normal distribution are independent.
> Proof 2d is easier.
#### Chi-Square Distribution
If $\vec{Z}\backsim N_p(0,I_p)$, then $X=\vec{Z}^T\vec{Z}$ follows a $\chi^2_p$ distrbution,
In particular, $f_X(x;p)=\frac{x^{p/2-1}e^{-x/2}}{2^{p/2}\Gamma(p/2)}1(x>0)$
i.e. $\chi^2_p\equiv\Gamma(p/2,1/2)$
#### Student's t-distribution
$Z\backsim\mathop{N(0,1)}$, $Y\backsim\chi^2_p$, and $Z,Y$ are indep.
$\Rightarrow X=\frac{Z}{(Y/p)^{1/2}}$ follows a $t_p$ distribution.
### $\S$ Random Sample
**Definition**
A collection of random variables/vectors $\{\vec{X}_i\}^n_{i=1}$ is called a random sample if $\vec{X_1},...,\vec{X_n}$ are i.i.d (independent, identically distributed) from a distribution $F_{\vec{X}}(\vec{x})$
Let $\{X_i\}^n_{i=1}$ be a random sample and $T(x_1,...,x_n)$ be a function. $T(X_1,...,X_n)$ (another random variable) is called a "statistic" if it contains no unknowned parameter. (except for $x_1,...,x_n$ (實現值))
The distribution of $T(X_1,...,X_n)$ is called the sampling distribution.
e.g.
* $\frac{x_1+x_2+...+x_n}{n}\triangleq\bar{X}$: Sample Mean
* $\frac{1}{n-1}\sum\limits^{n}_{i=1}(X_i-\bar{X})^2\triangleq S^2$: Sample Variance
$\mathbb E(\bar{X})=\mathbb E(\frac{1}{n}\sum\limits^{n}_{i=1}X_i) = \frac{1}{n}\sum\limits^{n}_{i=1}\mathbb E(X_i)=\frac{1}{n}\sum\limits^{n}_{i=1}\mu=\mu$ **(Sample Mean = Population Mean)**
> $\mathbb E(X_i) = \mu,\ \forall X_i$ (since i.i.d)
$var(\bar{X})=var(\frac{1}{n}\sum\limits^{n}_{i=1}X_i)=\frac{1}{n^2}var(\sum\limits^{n}_{i=1}X_i)=\frac{1}{n^2}(\sum\limits^{n}_{i=1}var(X_i)+2\sum\limits_{i<j}cov(X_i, X_j))\\=\frac{1}{n^2}\sum\limits^{n}_{i=1}var(X_i)=\frac{\sigma^2}{n}\rightarrow0$ as $n\rightarrow\infty$
> $cov(X_i, X_j)=0$, since i.i.d
> When $n\rightarrow\infty$, sample mean $\bar{X}\rightarrow\mu$ (population mean)
$\mathbb E(S^2)=\mathbb E\{\frac{1}{n - 1}\sum\limits^{n}_{i=1}(X_i - \bar{X})^2\}=\frac{1}{n - 1}\sum\limits^{n}_{i=1}\mathbb E\{(X_i-\bar{X})^2\}\\=\frac{1}{n - 1}\sum\limits^{n}_{i = 1}\mathbb E\{(X_i - \frac{1}{n} \sum\limits^{n}_{i = 1}X_i)^2\}=?$
---
Let $\vec{X}=\begin{pmatrix}X_1 \\ X_2 \\ ...\\ X_n\end{pmatrix} \Rightarrow \bar{X}=n^{-1}1^T_n\vec{X}$
$\Rightarrow\vec{X}-\bar{X}=\vec{X}-\bar{X}1_{n}=\vec{X}-n^{-1}1_n 1^T_n \vec{X}=(\mathbb I_n - n^{-1}1_n1^T_n)\vec{X}$
$\Rightarrow\sum\limits^{n}_{i=1}(X_i-\bar{X})^2\\
=\{(\mathbb I_n - n^{-1}1_n1^T_n)\vec{X}\}^T\{(\mathbb I_n - n^{-1}1_n1^T_n)\vec{X}\}\\
=\vec{X}^T\{(\mathbb I_n - n^{-1}1_n1^T_n)\}^T\{(\mathbb I_n - n^{-1}1_n1^T_n)\}\vec{X}\\
=\vec{X}^T\{(\mathbb I_n - n^{-1}1_n1^T_n)\}\vec{X}$
$\Rightarrow S^2=\frac{1}{n-1}\vec{X}^T\{(\mathbb I_n - n^{-1}1_n1^T_n)\}\vec{X}$
$\Rightarrow\mathbb E(S^2)=\mathbb E\{\frac{1}{n-1}\vec{X}^T\{(\mathbb I_n - n^{-1}1_n1^T_n)\}\vec{X}\}\\
=\frac{1}{n - 1}\mathbb E\{\vec{X}^T\{(\mathbb I_n - n^{-1}1_n1^T_n)\}\vec{X}\}\\
=\frac{1}{n - 1}[\mu^2 1^T_n (\mathbb I_n-n^{-1} 1_n 1 ^T_n)1_n + \sigma^2 \mathop{tr}\{(\mathbb I_n-n^{-1} 1_n 1 ^T_n)\mathbb I_n\}]\\
=\frac{1}{n - 1} (n - 1)\sigma^2 = \sigma^2$
> $\mathbb E(\vec{X}^TA\vec{X})=\mathbb E(\vec{X})^T A \mathbb E(\vec{X}) + \mathop{tr}\{A \mathop {var}(\vec{X})\}$
> ?????? How to proof?
> $\mathbb E(\vec{X})=\begin{pmatrix}\mu \\ \mu \\ ... \\ \mu\end{pmatrix}=\mu 1_n$
> $var(\vec{X})=\sigma^2 \mathbb I_n$
---
> Note: If $S^2=\frac{1}{n-1}\sum\limits^{n}_{i=1}(X_i - \bar{X})^2$, then $\mathbb E(S^2)=\sigma^2$
#### Theory
Suppose that $X_i \sim \mathcal{N}(\mu, \sigma^2)$, then:
> 去問要不要會證這個 ==
1. $\bar{X}$ and $S^2$ are independent.
2. $\bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})$
3. $n^{1/2}(\bar{X} - \mu)/S \sim t_{n-1}$
4. $(n - 1)S^2/\sigma^2 \sim \chi^2_{n - 1}$
#### Recall
Chabyshew's Inequality:
$\mathbb P(|X - \mathbb E(X)| \geq \epsilon) \leq \frac{var(X)}{\epsilon}$
$\Rightarrow\mathbb P(|\bar{X} - \mu| \geq \epsilon) \leq \frac{\sigma^2}{n\epsilon}$
#### Definition
A sequence of random variables $\{Y_n:\ n\in\mathbb N\}$ is said to converge in prob to $Y$, denoted by $Y_n\xrightarrow{P}Y$, if $\lim\limits_{n\rightarrow\infty}\mathbb P(|\bar{X} - \mu| \geq \epsilon) = 0$, $\forall\epsilon > 0$
#### Theory (Weak law of large number)
$\bar{X}\xrightarrow{P}\mu=\mathbb E(X_i)$, $\forall X_i$
#### Theory (Central limit theory)
$n^{1/2}\frac{\bar{X} - \mu}{\sigma}\xrightarrow{d}\mathop{N(0, 1)}$
##### Definition
A sequence of random variables $\{X'_n : n \in\mathbb N\}$ is said to converge in distribution to $X$ if $\lim\limits_{n \rightarrow \infty}F_{X_n}(x)=F_X(x)$, $\forall x \text{ s.t. } F_X(x) \text{ is continuous}$
> $X_n$ is the sequence of the random variables.
> Proof
> $\frac{n^{1/2}(\bar{X}-\mu)}{\sigma}=\frac{1}{\sqrt{n}}\sum\limits^{n}_{i=1}(\frac{X_i - \mu}{\sigma})=\sum\limits^{n}_{i=1}(\frac{X_i - \mu}{\sqrt{n} \sigma})=\sum^{n}_{i=1}Z_i$
> $\Rightarrow \psi_{Z}(t)\\=\mathbb E(e^{it\sum^{n}_{i=1}Z_i})\\=\prod\limits_{i=1}^{n}\mathbb E(e^{itZ_i})=\prod\limits_{i=1}^{n}\psi_{Z_i}(t)\\\approx [1+\frac{it}{\sqrt{n}\sigma}\mathbb E(\bar{X}-\mu) - \frac{t^2}{2\sigma^2}\mathbb E\{(\bar{X} - \mu) ^2\}]^n \\ = (1 - \frac{t^2}{2n\sigma^2}\sigma^2)^n \rightarrow e^{-\frac{t^2}{2}}$ (mgf of normal distribution)
> 泰勒展開式 2-nd order approximation, expanded at x = 0
#### Theory
Suppose that $X_n \xrightarrow{P} X$, $Y_n \xrightarrow{P} Y$, then:
1. $a X_n + b Y_n \xrightarrow{P} a X + b Y$
2. $X_n Y_n \xrightarrow{P} X Y$
3. If $g$ is continuous, then $g(X_n) \xrightarrow{P} g(X)$
* $X_n \xrightarrow{d} X \iff \mathbb E\{g(X_n)\}\xrightarrow{P} \mathbb E\{g(X)\}$, for each bounded continuous $g$
* If $X_n \xrightarrow{d} X$ and $g$ is continuous, then $g(X_n)\xrightarrow{d}g(X)$
#### Theory (Slutsky)
Suppose that $X_n\xrightarrow{d}X$ and $Y_n\xrightarrow{P}c$, where $c$ is a constant.
1. $X_n + Y_n \xrightarrow{d} X + c$
2. $X_n Y_n \xrightarrow{d} cX$
#### Theory (First order delta method)
Suppose that $n^{1/2}(X_i - \mu) \xrightarrow{d} N(0, \sigma^2)$ (asympotic normality) and $g$ is a $C^1$ function. Then:
$n^{1/2}\{g(X_n) - g(\mu)\} \xrightarrow{d} N[0,
\{g'(u)\}^2\sigma^2]$
> Note: $g(X_n) - g(\mu) \approx g'(u)(X_n - \mu)$
### $\S$ Point Estimation
#### Definition
1. A statistical model is a collection of distribution functions. (cdf, pmf, pdf)
2. Let $\mathcal M$ be a statistical model. If $\mathcal M$ can be indexed by $\theta \in \mathbb R^p$, i.e. $\mathbb M=\{F(X : \theta):\theta \in \Theta \subset \mathbb R^p \}$, then $\mathcal M$ is called a parametric method, $\theta$ is called the parameter, and $\Theta$ is called the parameter space.
> The parameter $\theta$ is called identifiable if the map: $\theta \mapsto F(X : \theta)$ is injective (one-to-one)
> Ex. for a circle with radius $r$ and parameter $\theta$, if $\Theta \in \mathbb R$, then there will be multuple value $\theta$ that maps to the same point. (e.g. $0, 2\pi$ maps to the same point)
3. A statistical parameters is a function $\theta: \mathcal M \mapsto \Theta$
> Ex. Mean, Variance
4. A point estimator is a statistic that specifies a point in a parameter space based on the observed data.
#### Common Point Estimator
By Weak Law of Large Number, $\bar{X}$ can serve as a point estimator.
> e.g. $X_1, ..., X_n \sim Bernoulli(p)$, $p \in [0, 1]$, let $p_0$ be the true parameter.
> According to WLLN, $\bar{X} \xrightarrow{P} \mathbb E(X)=p_0$, so we can use sample mean $\bar{X}$ to estimate $p_0$
> e.g. $X_1, ..., X_n \sim Exp(\lambda)$, $\lambda > 0$. Let $\lambda_0$ be the true parameter.
> $\bar{X} \xrightarrow{P} \frac{1}{\lambda_0}$
> $\hat{\lambda} = \bar{X}^{-1} \xrightarrow{P} \lambda_0$
> e.g. $X_1, ..., X_n \sim Unif(0, a)$. Let $a_0$ be the true parameter.
> $\bar{X} \xrightarrow{P} \frac{a_0}{2}$
> $\hat{a} = 2 \bar{X} \xrightarrow{P} a_0$
> e.g. $X_1, ..., X_n \sim Unif(a, b)$. Let $a_0, b_0$ be the true parameter.
> $\bar{X} \xrightarrow{P} \frac{a_0 + b_0}{2}$
> $\frac{1}{n}\sum\limits^{n}_{i=1}X_i^2 \xrightarrow{P} \frac{(b_0 - a_0) ^ 2}{12} + (\frac{a_0 + b_0}{2})^2$
> 解方程式
#### Moment Estimators
Let $\mathcal M = \{F(X : \theta) : \theta \in \Theta \subset \mathbb R^p\}$, for each $\theta \in \Theta$, the $k$-th moment is $\mu_k(\theta)=\int x^k dF(X : \theta)$. By WLLN, $n^{-1}\sum\limits^{n}_{i=1}X_i^k \xrightarrow{P} \mu_k(\theta_0)$
> Proof?
Let $\vec{Y}_i = (X_i, X_i^2, ..., X_i^p)$ and $\vec{\mu}(\theta)=(\mu_1(\theta, \mu_2(\theta), ..., \mu_p(\theta))$
Then $n^{-1}\sum\limits^{n}_{i=1}\vec{Y_i} \xrightarrow{P} \vec{\mu}(\theta_0)$
The moment estimator is defined as $\vec{\mu}^{-1}(n^{-1}\sum\limits^{n}_{i=1}\vec{Y_i})$.
#### Definition
Let $\hat{\theta}$ be an estimator for parameter $\theta \in \mathbb R$, $\hat{\theta}$ is called **consistent** estimator if $\hat{\theta} \xrightarrow{P} \theta_0$
Some commonly used criteria:
1. The bias of $\hat{\theta}$ is defined as $\mathbb E(\hat{\theta} ; \theta)-\theta$ (a funtion of $\theta$), $\hat{\theta}$ is called unbiased if $\mathbb E(\hat{\theta} ; \theta) - \theta = 0,\ \forall \theta \in \Theta$
2. $var(\hat{\theta} ; \theta)$
3. MSE: $\mathbb E\{(\hat{\theta} - \theta) ^2 ; \theta\}$
#### Maximum Likehood Estimator
---
e.g. $X_1, X_2 \sim Bernoulli(p),\ p \in [0, 1]$. Realizations: $X_1 = 1, X_2 = 0$
Question: $p = 0.1 \lor p=0.2$ ?
$\Rightarrow P(X_1 = 1, X_2 = 0, p = 0.1) = 0.1 \times 0.9 = 0.09$
$\Rightarrow P(X_1 = 1, X_2 = 0, p = 0.2) = 0.2 \times 0.8 = 0.16$
(Choose $p = 0.2$ to have a larger prob. to ovserve the realization of $(1, 0)$)
Let realization of $X_1 = x_1$, and realization of $X_2 = x_2$
$\Rightarrow P(X_1 = x_1, X_2 = x_2 ; p) = p^{x_1} (1 - p)^{1 - x_1} \times p^{x_2} (1 - p) ^{1 - x_2} = p^{x_1 + x_2} (1 - p)^{2 - (x_1 + x_2)} = f(p)$, $p \in [0, 1]$
$\Rightarrow \log f(p) = (x_1 + x_2)\log p + \{2 - (x_1 + x_2)\}\log{(1 - p)}$
$\Rightarrow \frac{d}{dp} \log{f(p)} = \frac{x_1 + x_2}{p} + \frac{2 - (x_1 + x_2)}{1 - p} = 0,\ p=\frac{x_1 + x_2}{2}$
$\Rightarrow \log f(0) = 0,\ \log f(1) = 0$
$\therefore f(p)$ has maximum when $p = \frac{x_1 + x_2}{2}$
> Recap: 極值發生在: 1) 一階導數=0 2) 一階導數不存在的地方 3) 邊界
$\Rightarrow \hat{p}=\frac{x_1 + x_2}{2}$ (MLE)
---
e.g. Let $X_1, ..., X_n \sim Bernoulli(p)$ (i.i.d), $p \in [0, 1]$
$\Rightarrow f(p) = P(X_1 = x_1, ..., X_n = x_n ; p) = \prod\limits^{n}_{i=1} p^{x_i} (1 - p)^{1 - x_i} = p^{\sum\limits_{i} x_i}(1 - p)^{n - \sum\limits_{i} x_i}$
$\Rightarrow \log f(p) = (\sum\limits_{i} x_i) \log p + (n - \sum\limits_{i} x_i) \log (1 - p)$
$\Rightarrow \frac{d}{dp} f(p) = \frac{\sum\limits_{i} x_i}{p} + \frac{n - \sum\limits_{i} x_i}{1 - p} = 0,\ p = \bar{X}$
$\Rightarrow f(0) = f(1) = 0$
$\therefore$ when $p=\bar{X}$, $f(p)$ has maximum.
$\Rightarrow$ MLE: $\hat{\theta}=\bar{X}=\frac{1}{n}\sum\limits^{n}_{i=1} X_i$
---
**待補**
### $\S$ Hypothesis Test
##### Definition
A hypothesis is a statement about a population parameter, say $\theta$
1. Null hypothesis: $H_0 : \theta \in \Theta_0$
2. Alternative hypothesis: $H_A : \theta \in \Theta_A$
> Note: $\Theta_0 \bigcup \Theta_A$ may not equal to $\Theta$
A hypthesis test is a rule only based on a sample $\vec{X}$ to decide whether to reject $H_0$ (hence accept $H_A$)
A hypothesis can be formulated as a test function $\varphi(\vec{x}) = \mathbb P(\text{reject } H_0 | \vec{X} = \vec{x})$, or simply $\varphi(\vec{x}) = \mathbb P(\text{reject } H_0 | \vec{X})$ or $\varphi$
> Note: $\vec{X} = (X_1, ..., X_n)$
i.e. $\varphi(\vec{x})$ and $\varphi$ are treated as statistics (and hence random variable)
1. If $\varphi(\vec{x}) = 1(\vec{x} \in C)$, it's called a non-randomized test.
2. For a non-randomized test, $C$ is called the rejection region.
---
Ex. $X_1, ..., X_n \sim Bernoulli(p)$ (i.i.d)
$\bar{X} = \frac{1}{n} \sum\limits_{i=1}^{n} X_i$
fair : $p_0 = 0.5$ (Null hypothesis)
If $\bar{X} = 0.6$, is it fair?
If $\bar{X} = 0.52$, is it fair?
Usually use the rejection region to help to determine whether the assumption is true.
i.e. Reject $p_0 = 0.5$ if $|\bar{X} - 0.5| > C$
Rejection area: $\{|\bar{X} - 0.5| > C\}$
while $C$ becomes larger:
$P(|\bar{X} - 0.5| > C ; p = 0.5) = E(\varphi ; p = 0.5)$ becomes smaller
$P(|\bar{X} - 0.5| > C ; p = 0.6) = E(\varphi ; p = 0.5)$ also becomes smaller
There's no a common $C$ such that prob. of making right decision can be maximized and that of making wrong decision can be minimized.
---
| | $H_0$ is true | $H_0$ is false |
| ------------ | ------------- | -------------- |
| Reject $H_0$ | Type I Error | Power |
| Accept $H_0$ | | Type II Error |
> Note: Power and Type II error may be a collection of value.
---
Ex. $X_1, X_2, ..., X_n \sim N(\mu, 1)$ (i.i.d), where $\mu$ is an unknown parameter.
$\begin{cases} H_0: \mu = 0 \\ H_A : \mu \neq 0\end{cases}$
1. Find a test statistic $\bar{X}$
2. Determine the shape of rejection region: $\varphi = 1(|\bar{X}| > C)$
3. Set significance level $\alpha = 0.05$
* Type I error: $P(|\bar{X}| > C ; \mu = 0)$
* Power: $P(|\bar{X}| > C ; \mu = \mu^* \neq 0)$
Type I Error:
$\because \bar{X} \sim N(0, \frac{1}{n})$
$\Rightarrow \sqrt{n} \bar{X} \sim N(0, 1)$
$\Rightarrow P(|\bar{X}| > C ; \mu = 0) = P(|\sqrt n \bar X| > \sqrt n C ; \mu = 0) = 2 \Phi(-\sqrt n C) \leq 0.05$
$\Rightarrow$ We can choose $C$ s.t. $2\Phi(-\sqrt n C) = 0.05$
i.e. $C = -\Phi^{-1}(\frac{0.05}{2})/\sqrt n$
$\varphi = 1(|\bar{X}| > -\Phi^{-1}(\frac{0.05}{2})/\sqrt n)$
---
Ex. $X_1, X_2, ..., X_n \sim N(\mu, 1)$ (i.i.d), where $\mu$ is an unknown parameter.
$\begin{cases} H_0: \mu = 0 \\ H_A : \mu > 0\end{cases}$
1. Find a test statistic $\bar{X}$
2. Determine the shape of rejection region: $\varphi = 1(\bar{X} > C)$
3. Set significance level $\alpha = 0.05$
* Type I error: $P(\bar{X} > C ; \mu = 0)$
* Power: $P(\bar{X} > C ; \mu = \mu^* > 0)$
Type I Error:
$\because \bar{X} \sim N(0, \frac{1}{n})$
$\Rightarrow \sqrt{n} \bar{X} \sim N(0, 1)$
$\Rightarrow P(\bar{X} > C ; \mu = 0) = P(\sqrt n \bar X > \sqrt n C ; \mu = 0) = 1 - \Phi(\sqrt n C) \leq \alpha$
$\Rightarrow$ We can choose $C$ s.t. $1 - \Phi(\sqrt n C) = \alpha$
i.e. $C = \Phi^{-1}(1 - \alpha)/\sqrt n$
$\varphi = 1(\bar{X} > -\Phi^{-1}(1 - \alpha)/\sqrt n)$
---
Ex. $X_1, X_2, ..., X_n \sim N(\mu, 1)$ (i.i.d), where $\mu$ is an unknown parameter.
$\begin{cases} H_0: \mu \leq 0 \\ H_A : \mu > 0\end{cases}$
1. Find a test statistic $\bar{X}$
2. Determine the shape of rejection region: $\varphi = 1(\bar{X} > C)$
3. Set significance level $\alpha = 0.05$
* Type I error: $P(\bar{X} > C ; \mu = \mu_0 \leq 0)$
* Power: $P(\bar{X} > C ; \mu = \mu^* > 0)$
both decrease in $C$
Type I Error: $P(\bar{X} > C ; \mu = \mu_0 \leq 0) \leq \alpha,\ \forall \mu_0 \leq 0 \equiv \max\limits_{\mu_0 \leq 0} P(\bar{X} > C; \mu=\mu_0) \leq \alpha$
i.e right tail's area, given $x > C$
$\Rightarrow$ maximum occurs when $\mu_0 = 0$
> Under $\mu = \mu_0 \leq 0,\ \bar X \sim N(\mu_0, \frac{1}{n})$
$\therefore \max\limits_{\mu_0 \leq 0} P(\bar{X} > C; \mu=\mu_0) \leq \alpha \equiv P(\bar X > C ; \mu = 0) \leq \alpha$
$\Rightarrow \varphi = 1(\bar X > \Phi^{-1}(1 - \alpha)/\sqrt n)$
> Composite Hypothesis: The hypothesis contains more than one parameter
> Simple Hypothesis: The hypothesis contains only one parameter
---
Ex. $X_1, X_2, ..., X_n \sim N(\mu, 1)$
$\begin{cases} H_0 : \mu = 0 & \text{(simple)}\\ H_A : \mu = 1 & \text{(simple)}\end{cases}$
likelihood ratio $\mathop{(L.R.)}$ test:
$1\{\frac{f_{\vec{X}}(\vec{X} ; \mu = 0)}{f_{\vec{X}}(\vec{X} ; \mu = 1)} \leq C\}$, typically $C$ should be smaller than $1$ to satisfy the level alpha test
$f_{\vec{X}}(\vec{X} ; \mu = 0) = (2\pi)^{-\frac{n}{2}}e^{-\frac{1}{2}\sum^{n}_{i=1} X_i^2}$
$f_{\vec{X}}(\vec{X} ; \mu = 1) = (2\pi)^{-\frac{n}{2}}e^{-\frac{1}{2}\sum^{n}_{i=1} (X_i - 1)^2}$
$\Rightarrow \mathop{L.R.} = e^{\frac{1}{2}\{\sum^n_{i=1}(X_i - 1)^2 - \sum^n_{i=1} X_i^2\}}$
$\Rightarrow \varphi = 1(\sum^n_{i=1}(X_i - 1)^2 - \sum^{n}_{i=1}X_i^2 \leq C^*) = 1(\bar X \geq C^{**})$
(Generalized ver.)
(i).
$\begin{cases} H_0 : \mu = 0 & \text{(simple)}\\ H_A : \mu > 0 & \text{(composite)}\end{cases}$
$\mathop{G.L.R} = \frac{f_{\vec{X}}(\vec{x} ; \mu = 0)}{\max\limits_{\mu \geq 0} f_{\vec{X}}(\vec{x} ; \mu)} \leq C$
> for denominator: $\theta\in\Theta_0\bigcup\Theta_A$
(ii).
$\begin{cases} H_0 : \mu
\leq 0 & \text{(composite)}\\ H_A : \mu > 0 & \text{(composite)}\end{cases}$
$\mathop{G.L.R} = \frac{\max\limits_{\mu \leq 0} f_{\vec{X}}(\vec{x} ; \mu)}{\max\limits_{\mu} f_{\vec{X}}(\vec{x} ; \mu)} \leq C$
> for denominator: $\theta\in\Theta_0\bigcup\Theta_A$
---
Ex. $X_1, X_2, ..., X_n \sim \text{Geomeric}(p)$ (i.i.d), where $p$ is an unknown parameter.
$\begin{cases} H_0: p = 0.01 \\ H_A: p < 0.01 \end{cases}$
$\hat{L}(p) = \prod\limits^{n}_{i = 1}(1-p)^{X_i}p = (1-p)^{\sum^{n}_{i=1}X_i}p^n$
$\Rightarrow \mathop{GLR} = \frac{\max\limits_{p \in \Theta_0} \hat{L}(p)}{\max\limits_{p \in \Theta_0 \cup \Theta_A} \hat{L}(p)} = \frac{\hat{L}(0.01)}{\max\limits_{p \leq 0.01} \hat{L}(p)}$
$\Rightarrow$ Rejection Region: $\{\frac{\hat{L}(0.01)}{\max\limits_{p \leq 0.01} \hat{L}(p)} \leq C\}$
1. Determine $\max\limits_{p \leq 0.01} \hat L(p)$
(微分以及討論邊界)
$\Rightarrow \max\limits_{p \leq 0.01} \hat L(p) = \begin{cases} \hat L(0.01) & \text{if } 0.01 \leq (\bar X + 1)^{-1} \\ \hat L((X + 1)^{-1}) & \text{Otherwise} \end{cases}$
> 剩下看手寫的
---
#### Definition
Let $\mathcal T$ be a class of tests for testing $H_0: \theta_0 \in \Theta_0 \mathop{vs.} H_A: \theta_A \in \Theta_A$
A test $\varphi \in \mathcal T$ is called a uniformly most powerful (**UMP**) class $\mathcal T$ test if $\mathbb E(\varphi ; \theta_A) \geq \mathbb E(\varphi^* ; \theta_A) ,\ \forall \theta_A \in \Theta_A \text{ and } \varphi^* \in \mathcal T$
#### Theory (Neyman-Pearson's lemma)
Suppose $\mathcal M=\{f(x;\theta):\theta \in \Theta\}$ is a statistical model.
Consider $H_0: \theta=\theta_0\ vs. \ H_A: \theta=\Theta_A$
For a fixed $k$, define $C=\{x: f(x ; \theta_A) > k f(x ; \theta_0)\}$ and $C_T=\{x: f(x; \theta_A) = k f(x;\theta_0)\}$
Let $\alpha_l=P(X \in C ; \theta_0)$ and $\alpha_u=p(X \in C \cup C_T; \theta_0)$
Then the NP likelihood ratio test $\varphi(x)=1(x\in C) + \frac{\alpha - \alpha_l}{\alpha_u - \alpha_l}1(x \in C_T)$ is a UMP level $\alpha$ test.