【統計學】04 離散分佈

# 【統計學】04 離散分佈 [TOC] --- ### Definition Probability distribution of a discrete random variable $X$ describes the behaviour of $X$ by specifying the $\text{possible values of } X$ and the $\text{corresponding probability for each possible value}$. ### Abstract 在本文中會討論以下五種離散分布 : - **$\text{Binomial}$** 只有兩種互斥結果的一連串 with replacement 事件 $\implies$ independent，而每一項測驗稱作 Bernoulli trial (伯努利試驗) $\to$ 找出在 $n$ 次試驗中，獲得指定結果 $X$ 次的機率。 - **$\text{Multinomial}$** 把 Binomial (二項式分布)的兩個變量擴展成多個變量，即為 Multinomial (多項式分布)。 - **$\text{Negative-binomial}$** 一系列獨立的伯努利試驗中的失敗次數，直到達到指定成功次數 $\to$ 找出在 $W$ 次試驗中，發生第 $k$ 次指定結果在第 $W$ 次試驗的機率。 (with replacement) - **$\text{Hypergeometric}$** 一連串 without replacement 事件 $\implies$ not independent $\to$ 在含有 $k$ 個指定結果的 $N$ 項事件，找到 $n$ 次試驗中，發生 $X$ 次指定結果的機率。 - **$\text{Poisson}$** 是離散分布的一種極限形式，套用微分觀念，將一段時間 $T$ 等分成 $n$ 個時間段，當 $n \to \infty$ 直到每個微小的時間段內最多發生一次該事件 $\to$ 找出在一段時間中，發生 $X$ 次的機率。 <img src='https://theacetutors.com/blogImages/Blog25.png'> --- ## Part 1. Binomial ### 1.1 Bernoulli process - **每項試驗都是「獨立」的** - **每項事件都是一個「Bernoulli trial」** $\to$ 實驗結果互斥且只有兩種 (either "success" or "failure") - **$p = P(\text{success})$ in each trial is a 「constant」** $\to$ 每個實驗獲得可能結果的機率相同 ### 1.2 Probability distribution function Random variable $X$ is the total number of successes in the Bernoulli process of $n$ trials with $p = P(\text{success})$ Probability distribution function of Binomial $$ P(X = i) = \binom{n}{i} p^i (1 - p)^{n-i},\ i = 0, 1, \dots, n $$ ### 1.3 期望值與離散度 ==$X \sim Binomial(n, p)$ $\text{, then }$ $E(X) = np$ $\text{and}$ $Var(X) = np(1-p)$== $$ I_k = \begin{cases} 1 \text{, if trial $k$ results in a success} \\ 0 \text{, otherwise} \end{cases} $$ Then, $E(I_k) = p(1) + p(1-p)(0) = p$ and $Var(I_k) = E(I^2_kk) - [E(I_k)]^2 = p(1 - p)$ $\because \ X = I_1 + \cdots + I_n \text{ where } I_1 + \cdots + I_n \text{ are}$ $\text{independent}$ $\therefore$ $\ E(X) = E(I_1) + \cdots + E(I_n) = np$ and $Var(X) = Var(I_1) + \cdots + Var(I_n) = np(1 - p)$ --- ## Part 2. Multinomial ### 2.1 例題假設在二項式分布中，獨立地丟擲 $n$ 顆球進兩個箱子中的其中一個，令每顆球進去一號箱子的機率為 $p$，$X$為進號箱子的球數 $\implies X \sim Binomial(n,p)$ 若是有 $k > 2$ 個箱子，且 $p_i = P(\text{each ball enters box } i), i = 1, \dots, k \implies \sum^k_{i=1}p_i = 1$，則： ==$(X_1, X_2, \dots, X_k) \sim Multinomial(n, p_1, p_2, \dots, p_k)$== ### 2.2 Probability distribution function Probability distribution function of Multinomial $$ P(X_1 = x_1, \dots, X_k = x_k) = \dfrac{n!}{x_1!\cdots x_k!} p_1^{x_1} \cdots p_k^{x_k}, \ 0 \le x_i \le n, \ \sum^k_{i=1}x_i = n $$ ### 2.3 條件機率、期望值、離散度與共變異數 #### 在給定 $X_1 = m$ 的情況下 $X_1 = M$ 給定之後，代表球不會再進 $box 1$，所以為 : $$ (X_2, X_3, \dots, X_k \mid X_1 = m) \sim multinomial(n-m, \frac{p_2}{1-p_1}, \frac{p_3}{1-p_1}, \cdots, \frac{p_k}{1-p_1}) $$ #### 期望值與離散度在多項式分布中，同理二項式分布 $X_j \sim Binomial(n, p_j) \text{ for each } j$，多項式分布的期望值 $E(X_j) = np_j$ 以及期望值 $Var(X_j) = np_j(1-p_j)$ $\implies$ $(X_i + X_j) \sim Binomial(n, p_i + p_j)$ #### Covariance Matrix 如果 $X_i > X_j \to \sum^k_{l=1}X_l = n$，$Cov(X_i, X_j) = -np_ip_j$，證明 : $\underbrace{Var(X_i)}_{np_i(1-p_i)} + \underbrace{Var(X_j)}_{np_j(1-p_j)} + 2Cov(X_i, X_j) = Var(X_i + X_j) = n(p_i+p_j)(1-p_i-p_j)$ $\implies Cov(X_i, X_j) = -np_ip_j$ --- ## Part 3. Negative-binomial 在第 $W$ 次試驗中，發生第 $k$ 次指定事件，且發生指定事件的機率為 $p \implies$ ==$W \sim Negative-binomial(k, p)$== #### Range of $W$ : $W \in \{k, k+1, \dots,\}$ ### 3.1 Probability distribution function Probability distribution function of Negative-binomial $$ P(W = i) = \begin{bmatrix} \binom{i-1}{k-1} p^{k-1} (1-p)^{i-k} \end{bmatrix} \cdot p \text{, } \ i=k, k+1, \dots $$ ### 3.2 期望值與離散度 $E(W) = \dfrac{k}{p}$ > - $W = G_1 + \cdots + G_k \text{ where } G_1 + \cdots + G_k \text{ are independent}$ > - $E(W) = E(G_1 + \cdots + G_k) = E(G_1) + \cdots + E(G_k) = \frac{k}{p}$ $Var(W) = \dfrac{k(1-p)}{p^2}$ > - $Var(W) = Var(G_1 + \cdots + G_k) = Var(G_1) + \cdots + Var(G_k) = \frac{k(1-p)}{p^2}$ ### 3.3 Geometric($p$) distribution 如果 $k = 1$ 則代表需要經過幾次試驗，才會得到第一次指定結果，也就是幾何分布。 --- ## Part 4. Hypergeometric 從包含 $k$ 個指定物件的總共 $N$ 個物件中，在不放回的情況下，令取出 $n$ 次中，獲得指定物件的事件為 $Y \implies$ ==$Y \sim Hypergeometric(N,k,n)$== **Range of $Y$ (possible values)** : $max(0, n-(N-k)) \le Y \le min(n,k)$ ### 4.1 Probability distribution function Probability distribution function of Hypergeometric $$ P(Y=y) = \dfrac{\binom{k}{y} \cdot \binom{N-k}{n-y}}{\binom{N}{n}} \text{ , } max(0, n-(N-k)) \le y \le min(k,n) $$ ### 4.2 期望值與離散度同理 Binomial，$E(Y) = \overbrace{n(\frac{k}{N})}^{np}$ and $Var(Y) = (\dfrac{N-n}{N-1}) \ \overbrace{n(\frac{k}{N})(1-\frac{k}{N})}^{np(1-p)}$ ### 4.3 Multivariate hypergeometric distribution 當超幾何分布的總物件中有超過兩種以上的指定物件種類時，結果會是 Multivariate hypergeometric distribution。 --- ## Part 5. Poisson Possion distribution 通常使用在**時間或空間**問題上，結果是發生在一段時間區間或是特定的區域。 ### 5.1 Poisson process - **在不重疊的時間段發生的事件為獨立隨機變數** $\implies$ 不同時間段發生的事件彼此無關 - **每個事件發生的機率是時間區段的部分長度** $\implies P(\text{one event happens during } (s, s + dt]) = \frac{e^{-\lambda dt} \cdot (\lambda dt)^1}{1!}$          $\approx [1-(\lambda dt)] \cdot (\lambda dt) = (\lambda dt) - (\lambda dt)^2 \approx (\lambda dt)$ - **超過一個事件發生在同一短時間區段的機率可以忽略不計** $\implies P(\text{no events happen during } (s, s + dt]) = \frac{e^{-\lambda dt} \cdot (\lambda dt)^0}{0!}$          $\approx [1-(\lambda dt)] \text{ and P(more than one event happen during } (s, s + dt]) \approx 0$ ### 5.2 Probability distribution function 在一段時間(或區域)的長度(或面積、體積) $i$，令 $X$ 為 Poisson process 的 outcome。 ==$X \sim Poisson(\lambda t)$== > $\lambda$ is rate (average number of outcomes per unit time (area, volume)) of occurrence of outcomes. #### Range of $X$ : $X \in \{0, 1, 2, \dots\}$ Probability distribution function of Poisson $$ P(X = x) = \dfrac{e^{-\lambda t} (\lambda t)^x}{x!} \text{ , } x = 0, 1, 2, \dots $$ ### 5.3 期望值與離散度 $E(X) = \lambda t$  and  $Var(X) = \lambda t$