基礎概念

做假設的技巧 :
幾乎做出的假設都是
$\Rightarrow$

$H_{0}$ : …<表格-列>… 與 …<表格-欄>… 相互獨立

$H_{1}$ : …<表格-列>… 與 …<表格-欄>… 相關
compute risk different :
在 sas 中使用 proc freq 裡面的 riskdiff
compute RR :
在 sas 中使用 proc freq 裡面的 relrisk
看表格最下面跑出來的 相對風險(欄n)
(n) 是代表我們想觀察的那一項在哪一欄
compute OR :
在 sas 中使用 proc freq 裡面的 relrisk
看表格最下面跑出來的 勝算比
使用甚麼 test 去看是否符合假設 :
通常都是使用
$χ^{2}$ 或是
$G^{2}$
這 2 個都是去測資料之間是否獨立
但如果資料的表格是比 2
$\times$ 2 還要大的就要改成使用 Pearson
$χ^{2}$
如果是使用 Pearson
$χ^{2}$ 時，程式碼還是會使用 chisq 來檢測
不同 test 的意義 :

${\begin{cases} chisq : 是否相關 \\ trend : 是否線性相關 \\ monotone : 是否具有單調性(在 sas 中要打 measures) \\ BD(Breslow-Day) : 3 個變數時，其中 1 個是否會影響另外 2 個 \end{cases}$

需要注意跑出來的報表我們想解釋的是哪一欄
如果不是我們要的需要使用 proc sort
把我們想解釋的放在左上角
不然跑出來的 risk diff / RR / OR 會出問題

解釋跑出來的值以及信賴區間 :
首先需要區分是 (risk diff) 還是 (RR 或 OR)
如果是 risk diff : 包含

0

的話代表沒有顯著差異 / 不包含的話就是有顯著差異
如果是 RR 或 OR : 包含

1

的話代表 2 者之間沒有顯著差異 / 不包含的話就是有顯著差異

RR 跟 OR 的解釋技巧 :

假設
$R R = \frac{π_{1} \Rightarrow P}{π_{2} \Rightarrow A} = 1.82$ : P 會 …<題目>… 的機率是 A 的 1.82 倍
(
$∵ π_{1} = R R \times π_{2}$ )
假設
$O R = θ = \frac{Ω_{1} \Rightarrow P}{Ω_{2} \Rightarrow A} = 1.83$ : 使用 P 得到 …<題目>… 的勝算是使用 A 的 1.83 倍
(
$∵ Ω_{1} = θ \times Ω_{2}$ )

如果取樣不平均的話 :
total 得出來的結論跟分開討論會有差別

程式碼實作

我們使用 SAS 來執行

風險

設定資料檔








data q1_1;
input alc status count;
cards;
0 0 17066 
0 1 48 
1 0 15415 
1 1 38 
run;

把資料做成交叉列表(contingency table)(不常用到)




proc freq data = q1_1;
weight count; /*weight : count 做加權(會把每列都當一個人)*/
table alc*status;   /*table 後面接上的"*"是把列表做成交叉列表(列*行)*/
run;

把資料排序成我們好解釋的





proc sort data = q1_1;
by decending status; 
/*因為位置是錯誤的，要修正(通常會是 上面:1 0 左邊 : 0 1)*/
/*小補充 (如果位置是錯誤的，可以先做排序)*/
/*(位置通常會是 上方:1 0，左方:0 1)--這是我們想看的解釋變數為 1 的時候*/

計算風險差值




proc freq data = q1_1 order = data;  /*按照給定的資料順序呈現*/
weight count; /*加權*/
table alc*status/ riskdiff;   /*列*行*/ /*riskdiff : 計算風險差值*/
run;

計算相對風險跟勝算比




proc freq data = q1_1 order = data;
weight count;
table alc*status/ relrisk;   /*列*行*/ /*relrisk : 計算相對風險*/
run;

獨立性檢定

設定資料檔












data q2_1;
input snoring hd count;
cards;
0 1 24
0 0 1355
2 1 35
2 0 603
4 1 21
4 0 192
5 1 30
5 0 224
run;

測 2 者是否獨立(做卡方獨立檢定)




proc freq data = q2_1 order = data;
weight count; /*加權*/
tables snoring*hd/chisq; /*做卡方檢定(測是否相關)*/
run;

看 Fisher Exact test (也是做獨立檢定，但樣本數比較少的時候)




proc freq data = q2_1 order = data;
weight count; 
table snoring*hd/ exact;   /*列*行 exact:Fisher Exact test*/
run;

測 2 者是否線性獨立




proc freq data = q2_1 order = data;
weight count;
tables snoring*hd/trend; /*做 trend test (測是否線性相關)*/
run;

測 2 者之間的關係是否為單調






proc freq data = q2_1 order = data;
weight count;
tables snoring*hd/measures; 
/*做 monotone test (測是否單調)*/
/*(要看 gamma)*/
run;

看

μ_{i j}

test 的值






proc freq data = q2_1 order = data;
weight count;
table snoring*hd/ expected deviation;   /*列*行*/
/*expected : expected cell count*/
/*deviation : n_{ij} - \mu_{ij}*/
run;

合併與否

設定資料檔(important)

























/*哪些項目可以合併*/
data q3_1;
input color satellite count;
/* 設定合併類別的範例*/
/*先比 1 2*/
if color = 1 then color12 = 1;
else if color = 2 then color12 = 2;
/*再比 3 4*/
if color = 3 then color34 = 1;
else if color = 4 then color34 = 2;
/*最後綜合 1 2 跟 3 4*/
if color in ( 1 2 ) then color1234 = 1;
else if color in ( 3 4 ) then color1234 = 2;
Cards;
1 1 9 
1 0 3
2 1 69
2 0 26
3 1 26
3 0 18
4 1 7
4 0 15
run;
/*之所以這樣分是因為要做有意義的分類*/
/*(題目的前 2 個都有 medium，後 2 個都有 dark)*/






/*分完之後要做檢定*/
proc freq data = q3_1;
weight count;
tables (color12 color34 color1234)*satellite/chisq; /*括號可以同時做很多個(分配律)*/
run;
/*檢定結果如果不顯著才可以合併 (也就是 p > 0.05)*/

3 個變數時，是否互相影響

設定資料檔
















data q4_1;
input race age sat count;
cards;
0 1 0 109 
0 1 1 57 
0 2 0 108 
0 2 1 47 
0 3 0 141 
0 3 1 41 
1 1 0 31 
1 1 1 13 
1 2 0 7 
1 2 1 5 
1 3 0 10 
1 3 1 3 
run;

測是否會有交互作用














proc freq data = q4_1 order = data; 
weight count; 
table age * race*sat/ cmh;  
/*z 變數 * x 變數 * y 變數*/ /*順序很重要*/
/*測 z 變數是否會影響到 x 跟 y 之間的關係*/
run;
/*看到最下面的均齊性是 0.5979 在 0.05 底下不拒絕 H0 --> 會有共同勝算*/
/*往上會看到 CMH 的統計量(上方是給定條件--對照 age)*/
/*非零 : 檢測線性相關*/
/*列 : 假設 X 是類別，Y 是連續(跟 T / Wilcoxon 一樣)*/
/*一般關聯 : X Y 的關聯，沒有特別設限，把 X Y 假設是 nominal 的情況*/
/*都小於0.05 --> 拒絕H0 --> 條件不獨立*/
/*接著看下方的勝算比 : 就可以給出說明 -->*/ 
/*給定被害人的種族是白人的情況下，被告的種族是白人，被判死刑的勝算是黑人的 0.9730 倍*/

建立模型(model)

設定資料檔












data q5_1;
input snoring hd count;
cards;
0 1 24
0 0 1355
2 1 35
2 0 603
4 1 21
4 0 192
5 1 30
5 0 224
run;

做 linear probability model
最後跑出的結果須寫成 :

\hat{π} (x) = Intercept + 第二項 * x












/*linear probability model*/
proc genmod data = q5_1 desc; /*decending 的縮寫*/
/*因為有興趣的是有心臟病的機率*/
/*差別的部分會出現在回應概況--hd的排序值第一個會是 hd = 1*/
/*原始寫法是要先 sort 過才能執行*/
weight count; /*需分清楚是否做加權*/
model hd = snoring/ link = identity dist = bin;
/*link 選的是機率線性模型 identity*/
/*第二個選的是分配 binary*/
/*左邊是反應變數 右邊是解釋變數*/
run;
/*\hat{pi}(x) = 0.0172 + 0.0198 x*/

做 Logistic regression
最後跑出的結果須寫成 :

l o g i t (\hat{π} (x)) = l o g (\frac{\hat{π} (x)}{1 - \hat{π} (x)}) = Intercept + 第二項 * x







/*Logistic regression*/
proc genmod data = q5_1 desc;
weight count;
model hd = snoring/ link = logit dist = bin;
run;
/*log(\hat{pi}(x) ) / (1- \hat{pi}(x) )= -3.8662 + 0.3973 x*/
/*x 前面係數的正負在 Logistic 跟 linear 應該要是相同的*/

做 Probit model
最後跑出的結果須寫成 :

Φ (\hat{π} (x)) = Intercept + 第二項 * x






/*Probit model*/
proc genmod data = q5_1 desc;
weight count;
model hd = snoring/ link = probit dist = bin;
run;
/*\Phi(\hat{pi}(x) )= -2.06 + 0.19 x*/

做 Log linear model 或 Poisson model
最後跑出的結果須寫成 :

l o g (\hat{μ} (x)) = - 3.3 + 0.164 x

或

e^{\hat{β}} = e x p (0.164)







/*Log linear model or Poisson model*/
proc genmod data = q5_1 desc;
/*不設權重，因為一筆就是一隻*/
model hd = snoring/ link = log dist = pois;
run;
/*log(\hat{\mu}(x)) = -3.3 + 0.164 x*/
/*e^ \hat{\beta} = exp(0.164)*/

檢測過度離散

設定資料檔





data q6_1;
infile "C:\Users\ACER\Desktop\類別 sas 檔\crab.dat";
input color spine width sat weight;
y = (sat>0);
run;

從簡單的方法檢測是否過度離散(看平均值跟變異數)





/*看平均值跟變異數是否相等 --> 從這個表可以看出有過度離散*/
proc means data = q6_1 n sum mean var;
class width;
var sat;
run;

用 genmod 看是否過度離散
從配適度的 值/DF，我們希望接近 1，如果不是的話就是過度離散

\Rightarrow

需要做調整








proc genmod data = q6_1;
model sat = width / link = log dist = pois
	aggregate;
run;
/*aggregate 整合 likelihood (因為這個資料是連續的所以不太需要用到)*/
/*檢視是否過度離散(Examine overdispersion)*/
/*看配適度的 值/DF ，預期要接近 1，超過的話代表就是過度離散*/
/*這樣就要做調整*/

做調整
使用 scale 的 pearson

χ^{2}

test 做調整
可以從最下方的縮放看出確實有調整過







/*做調整*/
proc genmod data = q6_1;
model sat = width / link = log dist = pois
	aggregate scale = pearson;
/*scale 是做卡方獨立檢定的時候會用到的規模*/
/*因為樣本比較小所以要選擇 pearson test*/
run;

同質性分析、過度離散調整

設定資料檔






data q7_1;
infile "C:\Users\ACER\Desktop\類別 sas 檔\crab.dat";
input color spine width sat weight;
color = color -1;
y = (sat>0);
run;

最簡單看是不是 monotone(單調) 的方法



proc sgplot data = q7_1;
vline width/ response = y stat = mean markers;
run;

跟 genmod 差不多的程序




proc logistic data = q7_1 desc; /*因為有興趣的事件是 1 所以使用 desc*/
model y = width;
/*因為是連續的所以不用調其他東西*/
run;

把資料檔從連續型的改成二元的





data q7_2;
set q7_1;
if color in ( 1 2 ) then light = 1;
else if color in ( 3 4 ) then light = 0;
run;

然後看他的分析(預設 param = effect)
參考係數是(1, -1)






/*Default : param = effect*/
proc logistic data = q7_2 desc; /*因為有興趣的事件是 1*/
class light;
/*因為想看的是 light 所以要設 class*/
model y = light;
run;

把預設的 param = effect 改成 ref(通常都會使用這個)
預設的參考組別是 (last)
參考係數是(1, 0)





/*Reference : param = ref default : last*/
proc logistic data = q7_2 desc;
class light/ param = ref;
model y = light;
run;

把預設的參考係數 param = effect 改成 ref
把預設的參考組別 ref = last 改成 first
算出的 勝算比點估計值 會跟參考組別為 last 的互為倒數






/*Reference : param = ref default : first*/
proc logistic data = q7_2 desc;
class light/ param = ref ref = first;
model y = light;
run;
/*淺色的 horsehoe crab 有 satellite 的勝算是深色的 2.689 倍*/

做同質性分析(看主成分)





/*Homogeneous association  -- main effect*/
proc logistic data = q7_2 desc;
class light spine/ param = ref ref = first;
model y = light spine;
run;

做異質性分析(看主成分)(然後看交互作用)





/*Hetergeneous association  -- add interaction*/
proc logistic data = q7_2 desc;
class light spine/ param = ref ref = first;
model y = light spine light * spine;
run;

因為從上面的連結性檢定跟全域檢定看出 2 個得出的結果不一致
因此跑出表格來看是不是係格的問題
我們期望係格都很多或都很少，而不是有多有少

接著我們看過度離散
如果偏差跟 pearson 都大於 0.05 代表不拒絕

H_{0}

，沒有過度離散





/*Overdispersion*/
proc logistic data = q7_2 desc;
class color spine/ param = ref ref = first;
model y = color spine/ aggregate scale = none;
run;

假設有過度離散需要做調整
把 scale = none 改成 p 就是做 pearson 調整





proc logistic data = class9a desc;
class color spine/ param = ref ref = first;
model y = color spine/ aggregate scale = p;
/*Pearson chi-square adjustment*/
run;

程式碼的小筆記專區

weight 的使用時機:
如果是原始資料的話就不需要加權，反之，其他情況都會需要
要記得加上 order = data 因為大部分的情況都會希望用我們給定的排序
記得判別我們想要解釋的變數在哪裏
如果位置是錯誤的記得要 proc sort...by decending 一下
至於怎麼判別位置
$\Rightarrow$ 通常會是 上面:1 0 左邊 : 0 1 ，這樣的情況是我們想看的解釋變數為 1 的時候
左上方的那個就是我們想看的
計算風險

${\begin{cases} riskdiff \Rightarrow 風險差值 \\ relrisk \Rightarrow RR--相對風險或 OR--勝算比 \end{cases}$
RR 就看報表中 相對風險 那一格，但因為還會分(欄1)(欄2)
通常已經整理好的話就會看(欄1)，也就是我們想知道的那一項
OR 就看報表中 勝算比 那一格
風險的解釋
首先看他們的信賴區間的值
風險差值 : 包含
$0$ 的話代表沒有顯著差異 / 不包含的話就是有顯著差異
RR 或 OR : 包含
$1$ 的話代表 2 者之間沒有顯著差異 / 不包含的話就是有顯著差異
然後下結論的時候
$\Rightarrow$
RR : … 會是 … 的
$p$ 倍
OR : … 會 … 的勝算會是 … 的
$p$ 倍
測相關性
首先會先假設 :

$H_{0}$ : …<表格-列>… 與 …<表格-欄>… 相互獨立 / 無線性關聯

$H_{1}$ : …<表格-列>… 與 …<表格-欄>… 相關 / 有線性關聯
接著就能跑報表

${\begin{cases} chisq \Rightarrow 是否獨立, p < 0.05 代表 reject H_{0}, 2 變數有關聯 / 不獨立 \\ trend \Rightarrow 是否線性相關, p < 0.05 代表 reject H_{0}, 2 變數有線性相關, 列百分比會隨 x 上升/下降而上升/下降 \\ measures \Rightarrow 是否單調, gamma/ASE = z 值，如果 |z| > 1.96 代表 reject H_{0}, 2 變數是單調關係 \end{cases}$
看哪些類別可以做合併
假設現在有 1, 2, 3, 4 這四個類別
先分成 1 2 跟 3 4
接著分成 1+2 3+4
然後把這些都拿去做卡方檢定
當
$p < 0.05$ 就代表他們顯著，類別之間有差異，不能做合併
因此只能合併
$p > 0.05$ 的分類
3 個變數時，測是否有同質關係
使用到 cmh
並且看到最下方均齊性的卡方值 & p 值
如果
$p > 0.05$ 代表
$not reject H_{0}$
也就代表第 3 項不會干擾到另外 2 者的關係
這樣才能說明均齊性符合，可以看表說明內容
反之，如果
$reject H_{0}$ 就不能看表
因為代表第 3 項與其他 2 者有關係
做 model 時
注意題目是想做哪種 model 以及注意最後寫出來的等式左方
並且
$x$ 前面係數的正負在 Logistic 跟 linear 應該要是相同的
除此之外，注意題目是否有需要加入 desc 把我們想看得放在第一項

齊一性 : 會有共同勝率
$θ$ ，因此可以共同討論
異質性 : 需要分開解釋
會使用 cmh 去測
更改參數設定不會影響勝算比
差別只是把二元的參數設成(1, -1)還是(0, 1)
為甚麼要看是否單調(monotone)
如果資料的呈現形式是單調的代表我們可以把資料視為連續型
cmh 跑出來的假設定義
非零 :
檢測 : 檢測線性相關
假設 : 假設 2 者皆為連續變數
列 :
檢測 : 檢測 X 是類別，Y 是連續(跟 T / Wilcoxon 一樣)
假設 : 假設 X 為類別變數、Y 是連續變數
一般關聯 :
檢測 : 檢測 X Y 的關聯，沒有特別設限，把 X Y 假設是 nominal 的情況
假設 : 沒有額外的假設
cmh 比較模型好壞
通常會比較有交互作用跟沒有交互作用的
看 2 者模型配適統計值中的 AIC，值越小代表該模型越好
cmh 解釋

像這種的解釋就是 :
light = 1 的勝算會是 light = 0 勝算的
$e^{- 0.2657}$ 倍
light = 1 且 spine = 2 的勝算會是 light = 0 且 spine = 1 勝算的
$e^{0.6712}$ 倍

基礎概念

程式碼實作

風險

獨立性檢定

合併與否

3 個變數時，是否互相影響

建立模型(model)

檢測過度離散

同質性分析、過度離散調整

程式碼的小筆記專區

Read more

cryptography 圖文解析

高公局比賽問題

常用的 latex

mathstat_midterm_review2