\(\hat p\) 樣本比例
\(\mu\) = 母體平均數 = 中央趨勢量數
\(\sigma\) = 母體變異術 = 分散趨勢量數
\(p\) = 母體比例
https://www.myclass-lin.org/wordpress/archives/615
隨機變數
給定樣本空間\((S,{\mathbb {F}})\),如果其上的實值函數 \(X:S \to {\mathbb {R}}\) 是 \(\mathbb{F}\) (實值)可測函數,則稱\(X\)為(實值)隨機變數。
A random variable is a measurable function \({\displaystyle X\colon \Omega \to E}\) from a set of possible outcomes \(\Omega\) to a measurable space \(E\).
代數性質
\((\sigma)^2={1 \over N}{\Sigma}_1^N(X_i-\mu)^2\)
移項,拆開後得到
\({\Sigma}X_i^2=N{\sigma}^2+N\mu\)
亦可表達為「\(\sigma^2=\) 平方的期望值-期望值的平方」
\((\sigma)^2={{\Sigma}X_i^2 \cdot f(x)}-\mu^2\)
樣本變異數,亦若是
\({\Sigma}x_i^2=(n-1)s^2+n \cdot \bar x\)
平移不變性
平方擴充性
\({{\sigma}_{x,y}}^2={Cov}(X,Y)\)
\(=\Sigma_y\Sigma_x(x-\mu_x)(y-\mu_y)f(x,y)\)
\(={E}((X-{\mu}_X)(Y-{\mu}_Y))\) 定義式
\(={E}(XY-{\mu}_X \cdot Y-{\mu}_y \cdot X+{\mu_X}{\mu}_Y)\)
\(=E(XY)-\mu_X \cdot E(Y) - \mu_Y \cdot E(X)+E(\mu_X \mu_Y)\)
\(=E(XY)-\mu_X \mu_Y\)
\(=E(XY)-E(X)E(Y)\) 計算式
待自己證
\(Var(aX+bY)=a^2Var(X)+b^2Var(Y)+2ab \cdot Cov(X,Y)\)
https://zh.wikipedia.org/wiki/皮尔逊积矩相关系数
Correlation Coefficient
\(\rho_{X,Y}={\mathrm{cov}(X,Y) \over \sigma_X \sigma_Y} ={E[(X-\mu_X)(Y-\mu_Y)] \over \sigma_X\sigma_Y}\)
完全正相關 \(\rho_{X,Y}=1\)
正相關:共變異數>0
負相關參考資料:菲利浦曲線
母體相關係數\(\rho_{X,Y}=Corr(X,Y)\)
母體標準差\(\sigma_{X,Y}=Cov(X,Y)\)
樣本共變異數\(\hat{S_{x,y}}={1 \over {n-1}}\Sigma_1^n(x_i-\bar x)(y_i-\bar y)\)
樣本相關係數\(\hat{r_{x,y}}\)
我們希望能夠從 樣本推母體
\(S_{xy}={{\Sigma_1^nx_iy_i}- n\bar x\bar y}\)
\(S_{xx}={{\Sigma_1^nx_i^2}- n(\bar x)^2}\) 即 \(\Sigma(x_i- \bar x)^2\)
\(S_{yy}={\Sigma_1^ny_i^2}- n(\bar y)^2\) 即 \(\Sigma(y_i- \bar y)^2\)
\(\hat{r_{x,y}}={\hat{S_{xy}} \over {\hat S_{xx} \hat S_{yy}}}\)
樣本標準差\(s_x={{S_{xx}} \over {n-1}}\)
https://zh.wikipedia.org/wiki/切比雪夫不等式
\(P( \left\|{x- \mu} \right\| \lt z \sigma) \gt 1 - {1 \over z^2}\)
By Markov Theorem
We have \(P(X \ge a) \le {E(X) \over a}\), Take \(X = |x-\mu|\)
\(\Rightarrow P(|x-\mu| \ge a) \le {E(|x-\mu|) \over a}\)
\(\Rightarrow P(|x-\mu| \ge a)^2 \le {E((x-\mu)^2) \over a^2}\)
\(\Rightarrow P(|x-\mu| \ge a)^2 \le {Var(x) \over a^2}\)
\(\Rightarrow P(|x-\mu| \ge a) \le {\sigma \over a}\)
\(\Rightarrow P( |x- \mu| \ge a \sigma) \le {1 \over a^2}\)
That is Chebyshev's Theorem!
eg:
台大 | 中山 | 政大 | (人數) | |
---|---|---|---|---|
男 | 30 | 66 | 234 | 330 |
女 | 18 | 42 | 210 | 270 |
48 | 108 | 444 | 600 |
列聯表
台大 | 中山 | 政大 | 機率 | |
---|---|---|---|---|
男 | 0.05 | 0.11 | 0.39 | 0.55 |
女 | 0.03 | 0.07 | 0.35 | 0.45 |
機率 | 0.08 | 0.18 | 0.74 | 1 |
邊際機率:在有兩個以上的事件的樣本空間中,若僅考慮某一事件個別發生的機率,稱為邊際機率。
也就是最右邊的 column 及 最下面的 row
獨立事件:自己看
\(P(A|M)\):念作 probility of \(A\) condition \(M\)
算機率在離散型要注意等號
axiom:
貝氏定理:
設\(A_1,A_2...A_n為\Omega中之一組分割,B為\Omega上之任意分割事件,則P(A_i|B)=\frac{P(B|A_i)P(A_i)}{\sum_{i=1}^{n}P(B|A_i)P(A_i)}\)
Except
\(E(X)=\mu\)
\(Var(X)=\sigma^2 = E[(x-\mu)^2]\)
r.v. \(X, X \sim B(n,p)\) ~ : belongs to(服從)
\(f_{\otimes}(x)=\{^{C^{n}_{x}P^x(1-P)^{n-x}, \forall x \in \mathbb{N} \cup \{0\}}_{0\quad\quad,其他(otherwise)}\)
P:成功的機率
二項式分配:當 n = 1 時是 bernoulli
設x為離散型r.v.,則\(f_x(x)=\{^{P(X=x),x\in R_x}_{0, x \not \in R_x}\quad\) R:range
\(f_{xy}(x,y)=\{^{P(X=x,Y=y),(x,y)\in R_{xy}}_{0\quad\quad,(x,y)\not\in R_{xy}}\)
老師喜歡這樣表達:當你寫P(),你要在 () 中描述完整事件,所以要寫得像:P(Z<z)或f(x)…
* class P(Event);
* class f(var);
\(f(z) \ne P(Z<z)\)
\(f(z)\) 是單點機率密度
\(P(Z<z)\) 是事件機率
只有 Possion, normal 分布有封閉性
\[P(x) = p^xq^{1-x}\]
$$
iid: 獨立且同樣集合,Independent and identically distributed
Definition
在n個獨立的是/非試驗中成功的次數的離散機率分布,其中每次試驗的成功機率為p。其分佈即為二項分佈。
\[P(x) = {n\choose m} p^x q^{n-x}\]
$$
有封閉性
\[P(x) ={e^{- \lambda } \lambda^x \over x!}\]
Definition
A discrete random variable X is said to have a Poisson distribution with parameter λ > 0, if, for x = 0, 1, 2, …, the probability density function of X is given by:
\[P(x) ={e^{- \lambda } \lambda^x \over x!}\]
$$
Definition
\[{k \choose x}{N-k \choose n-x} \over {N \choose n}\]
$$
有封閉性
\[f(x) = {1 \over \sqrt{2 \pi} \sigma} e^{{-1 \over 2}({x- \mu \over \sigma})^2}\]
$$
Definition
將一連續變項之觀察值發生機率以圖呈現其分布情形,且具有以下特性:
以平均數為中線,構成左右對稱之單峰、鐘型曲線分布。
觀察值之範圍為負無限大至正無限大之間。
\(X \sim N(\mu, \sigma^2)\)
積起來很不好積,所以查表
Computing Probabilities for Any Normal
Probability Distribution
常態分配做線性變換,依舊是常態分配
反標準化
\[f(x) = {1 \over \mu} e^{-x \over \mu}\]
$$
https://zh.wikipedia.org/zh-tw/指数分布
令 τ 為 隨機變數 且其 機率密度(probability density) 滿足
\(f_τ(t):=λ e^{−λt}, if\ t \ge 0;\)
\(f_τ(t):=0, if\ t \lt 0\)
其中 λ>0 為常數。則我們說 τ 為 exponential distribution 或者說 τ 為 Exponential 隨機變數
\(E(x) = {\int}^{\infty}_0 x{1 \over \mu} e^{-x \over \mu} dx
=\sigma\)
\(Var(x) = \mu^2\)
By part
公式:\(P(x>x_0)=e^{-x_0 \over \mu}\)
proof:
若某計次過程服從 poisson process \(\iff\) 間格時間必服從指數分布
指數分布的 \(\mu\) 跟 poisson 的 \(\mu\) 互為倒數
注意單位,使用標準單位不容易錯
eg:
Poisson: \({e^{- \lambda } \lambda^x \over x!}\)
\(\iff\)
Expnential: \({\lambda} e^{-y \lambda}\)
樣本統計量的分配,稱為抽樣分配
我們主要想要估測三件事
平均數、標準差、百分比
我們說這是統計參數
eg: \(X_1, X_2 ... X_n\)
\(\bar{x} = {1 \over n} \Sigma X\)
\(Var(\bar x) = Var( {1 \over n} \Sigma X) = {\sigma^2 \over n}\)
重點: \(\bar x\) 好用
\(x_1, x_2 ...X_n \sim^{iid} f_{x_i}(x_i, \theta)\)
用 \(\hat \theta\) 去推論母體參數 \(\theta\)
估計值跟估計量是不同的,估計量有無限多個
有 hat 是估計量
\(Bias({\bar \theta}) = E({\bar \theta}) - \theta = 0\)
證明 \(s^2-\sigma^2 =0\)
\(E(s^2) = E({1 \over (n-1)} \Sigma(x^2_i) - n{\bar x}^2)\)
\(= {1 \over (n-1)} (\Sigma(E(x^2_i))-nE({\bar x}^2))\)
\(= {1 \over (n-1)} (\Sigma(Var(x)+E^2(x))-nE({\bar x}^2))\)
\(= {1 \over (n-1)} (\Sigma(\sigma^2+\mu^2)-nE({\bar x}^2))\)
\(= {1 \over (n-1)} (\Sigma(\sigma^2+\mu^2)-n({\sigma^2 \over n}+\mu^2))\)
\(= {1 \over (n-1)} (n\sigma^2+n\mu^2-\sigma^2-n\mu^2)\)
\(= {1 \over (n-1)} (n\sigma^2-\sigma^2)\)
\(=\sigma^2\)
倒著寫即可。
有效性是以估計式的平均平方誤差來衡量, 越小代表估計式的有效性越高。
sum of least squares
當樣本數增大時, 估計值會漸近於母體參數真值。
A consistent estimator is one for which, when the estimate is considered as a random variable indexed by the number n of items in the data set, as n increases the estimates converge in probability to the value that the estimator is designed to estimate.
信賴區間(英語:Confidence interval,C.I)
\([L,U]\) 估計 \(\theta\),在 \((1-\alpha)100\%\) 信心水準
信心水準 \((1-\alpha)100\%\) 越大表示:越大的信心區間 [L, U] 會包含真實的母體 \(\theta\)
\((1-\alpha)\)是中間面積
\(1-\alpha = P(L \lt \theta \lt U)\)
Pivotal Quantity
樞紐量有
https://en.wikipedia.org/wiki/Pivotal_quantity
wikiA pivotal quantity or pivot is a function of observations and unobservable parameters such that the function's probability distribution does not depend on the unknown parameters.
通常是點估計量的 t 或 z 分配
\(x_1, x_2 ...x_n\) 與 \(\theta\) 之函數組合
記為 \(Q({\hat \theta_i}; \theta)\),且其機率分配不依賴於任何未知母數
(即,可完全被掌握)
\(g(\hat \theta ,\theta) = \sqrt{n}\frac{\hat \theta - \theta}{s}\)
為什麼 t 分配的自由度是 n-1?
因為t分配中的未知待估母數只有一個(\(\mu\))
因此未必自由度是 n-1
\(\sigma\) 已知樞紐量是 z
查 t 表,如果自由度很大的時候,可以近似去查 z 表
http://mail.tku.edu.tw/yinghaur/lee/stat-new/第十章補充–%E7%B5%B1%E8%A8%88%E4%BC%B0%E8%A8%88(%E6%AF%8D%E9%AB%94%E8%AE%8A%E7%95%B0%E6%95%B8%E4%B9%8B%E5%8D%80%E9%96%93%E4%BC%B0%E8%A8%88).pdf
試驗 k 次,平均有 \(1-\alpha\) 次,未知待估母數會落在該區間。
\(X_1, X_2, ... X_n \sim^{iid} Ber(p)\)
margin error = \(z_{\alpha \over 2}\sqrt{\hat p(1- \hat p) \over n}\)
讓樣本據說話
檢定力(power),檢定力的大小,就是檢定的有效程度大小:
有罪推論 | 無罪推論 | |
---|---|---|
H0 | 有罪 | 無罪 |
Ha | 無罪(需負舉證責任) | 有罪 |
H0 | !H0 | |
---|---|---|
reject | \(\alpha\) type one error | 1-\(\beta\) |
Do not reject | 1-\(\alpha\) | \(\beta\) type two error |
如果題目沒說 \(\alpha\) 沒說,一般來說設 0.05
樣本觀察值的尾機率
A p-value is a probability that provides a measure of the evidence againest the null hypothesis provided by the sample.
Smaller p-value indicate more evidence againest \(H_0\).
魏丞偉把檢定統計量的絕對值拿掉,假設是檢定統計量是x,|x| > a => x > a or x < -a,之後再查表找大於a,小於-a的尾巴機率,加起來就會是p-value。
結論必一致
自己算樣本變異數,所以使用 t 分配
\(T_\nu = {Z \over \sqrt{\chi^2 \over \nu}} \sim T\)
\(Z\) is a standard normal distribution
\(\nu\) is the degree of freedom
\(\chi^2\) is a Chi-square distribution
單尾檢定
\(\mu_0-{\sigma \over \sqrt{n}}\mathcal{z}_\alpha = \mu_a+{\sigma \over \sqrt{n}}\mathcal{z}_\beta\)
左尾右尾可交換,所就用左尾檢定表示,算法相同。
因此,\(n={\sigma^2(\mathcal{z}_\alpha+\mathcal{z}_\beta)^2 \over (\mu_0 - \mu_a)^2}\)
注意這邊 \(\alpha\) 值有可能因為雙尾檢定而除以 2
想像:用 \(\alpha\) 算閾值的砍點跟用 \(\beta\) 算肯定會一樣,而根據這砍點,定義我的 \(\alpha\) 要多少
Recall: \(\bar x_1 - \bar x_2 \to \mu_1 - \mu_2\)
\(a\bar x_1 - b\bar x_2 \sim N(a\mu_1 - b\mu_2, {(a\sigma_1)^2\over n_1} + {(b\sigma_2)^2\over n_2})\)
同樣的 \(Var(aX+bY)=a^2Var(X)+b^2Var(Y)+2ab \cdot Cov(X,Y)\)
然後依樣畫葫蘆,放變數進去
\(\sigma = \sqrt{{(a\sigma_1)^2\over n_1} + {(b\sigma_2)^2\over n_2}}\) 我個人稱作 coSigma
在假說檢定上,需要有一個 const 放在右邊(待改進說法),所以會盡量把變數放在左邊,做假說檢定。
\(H_0: \mu_0 > \mu_1\)
\(\to \mu_0 - \mu_1 > 0\)
\(power = 1- \beta\)
使用T分配
同質(Homogeneous)變異數假設:\(\sigma_1 = \sigma_2\)
\(S_p^2 = \sigma^2 = {{(n_1 - 1)S_1^2+(n_2 - 1)S_2^2} \over {n_1 + n_2 - 2}}\)
如此帶入
檢定統計數 \(TS = {{(\bar x_1 - \bar x_2)-(\mu_1 - \mu_2)} \over \sqrt{S^2_p({1 \over n_1}+{1 \over n_2})}}\)
自由度:\(n_1 + n_2 - 2\)
檢定統計數 \(TS = {{(\bar x_1 - \bar x_2)-(\mu_1 - \mu_2)} \over \sqrt{{{s_1^2}\over n_1}+{{s_2^2}\over n_2}}}\)
自由度為(取高斯整數):
\(df = {({{s_1^2 \over n_1}+{s_2^2 \over n_2}})^2 \over \sqrt{{1 \over n_1-1}{s_1^2 \over n_1}+{1 \over n_2-1}{s_2^2 \over n_2}}}\)
(成對樣本)相依母體
Sample matched, pair!
eg: 實驗組、對照組
\(Sample \ size: n\)
\(d_k = {x_1}_k - {x_2}_k\)
\({\Sigma d_k \over n}= \bar D\)
\(S_D^2 = \Sigma(d_i- \bar D)^2\)
\(H_0: \mu_D = C\)
\(T = {{\bar D - \mu_D} \over {S_D \over \sqrt{n}}} \sim T(n-1)\)
\(\bar p_1 - \bar p_2 \sim N(p_1-p_2, {p_1q_1 \over n_1}+{p_2q_2 \over n_2})\)
因為沒有 \(p_1 \ p_2\) 所以變異數使用 \({\bar p_1}\) & \({\bar p_2}\) 代替
\(if \ \ \ \ H_0:(p_1 = p_2 = p)\)
\(p = {{n_1 \bar p_1 + n_2 \bar p_2} \over {n_1 + n_2}}\)
\(\sigma = \sqrt{pq({1 \over n_1}+{1 \over n_1})}\)
Chi-Square symbol: \({\chi}^2\)
推導:
\(s^2 = {1 \over n-1}\Sigma(x_i- \bar x)^2\)
\(\Rightarrow (n-1) s^2 = \Sigma(x_i- \bar x)^2\)
\(\Rightarrow {(n-1) s^2 \over \sigma^2} = {\Sigma(x_i- \bar x)^2 \over \sigma^2} = (Z^2_1+Z^2_2+Z^2_3+ ... +Z^2_n)\sim {\chi}^2_{(n-1)}\)
Chi-square doesn't closed!!
\(c \cdot {\chi}^2 \notin {\chi}^2, \forall c \in R\)
\(E(\chi^2) = df\)
卡方變數之期望值=自由度
\(Var(\chi^2) = 2df\)
卡方變數之變異數=兩倍自由度
檢定統計數:
\(TS = {(n-1)s^2 \over \sigma^2_0} \sim {\chi}^2_{(n-1)}\)
because
\(\chi^2_{1-{\alpha \over 2}} \le TS \le \chi^2_{\alpha \over 2}\)
\(\Rightarrow {(n-1)s^2 \over \chi^2_{\alpha \over 2}} \le \sigma^2 \le {(n-1)s^2 \over \chi^2_{1-{\alpha \over 2}}}\)
移項而已
Then we can say \(\sigma\) has {\(1-\alpha\)}% confidence in this intervel!
F-distribation
必要條件:
\(X \sim F({df}_1, {df}_2)\)
\({df}_1 = n_1 - 1\)
\({df}_2 = n_2 - 1\)
一個F-分布的隨機變數是兩個卡方分布變數除以自由度的比率:
\({U_1/d_1 \over U_2/d_2} = {U_1/U_2 \over d_1/d_2}\)
其中,\(U_1 \sim \chi^2_1, U_2 \sim \chi^2_2\)彼此獨立,自由度為 \(d_1, d_2\)
檢定統計數:
\(TS = {s^2_1 \over s^2_2}\)
標準差較大的放上面
可以保證出來的檢定統計數,是在右尾
卡方分配(chi-square distridution)
檢定統計數:
\(\chi^2 = \Sigma_i\Sigma_j{(f_{ij}-e_{ij})^2 \over e_{ij}} \sim \chi^2_{(r-1)(c-1)}\)
\(f_{ij}\) = reality value
\(e_{ij}\) = expected value, \(H_0\), \(\forall e_{ij} \ge 5\)
\(r\) = number of rows
\(c\) = number of columns
\(CV_{ij} = \sqrt{\chi^2_{\alpha}}\sqrt{{\bar p_i \bar q_i \over n_i}+{\bar p_j \bar q_j \over n_j}}\)
where
\(\chi^2_\alpha\) with a level of significance \(\alpha\) and \(k \ – 1\) degrees of freedom
\(\bar p_i\) and \(\bar p_j\) are the proportions for the populations \(i\), \(j\)
\(n_i\) and \(n_j\) are the sample size of populations \(i\), \(j\)
Reject or significant if:
\(|{\bar p_i - \bar p_j}| \gt CV_{ij}\)
use preverious formula to judge whether the \(\chi^2\) is siginificance.
\(H_0\): Assumes that there is no association between the two variables.
\(H_a\): Assumes that there is an association between the two variables.
適合度
檢定統計數:
\(\chi^2_{(k-1)} = \Sigma^k_{i=1}{(f_i - e_i)^2 \over e_i}\)
\(f_i\) is the reality value
\(e_i\) is the expected value, \(\forall e_i \ge 5\)
\(k\) is the number of categories
Use Goodness of fit test to test whether it is normal distribution.
\(n\) divided by 5 in to \({\lfloor}{n \over 5}{\rfloor}\) slice.
each slice is the \(e_i\)
And test it's \(\chi^2_{({\lfloor}{n \over 5}{\rfloor} -3)}\)
beacuse the degree of freedom is \(k - p -1\)
\(p\) is the number of parameters of the distribution estimated by the sample.
And the Normal distribution has 2 parameters.
Hence \(k-p-1 = k-3\)