# Linear tendency
###### tags: `Categorical data analysis`
## Amended odds ratio
When any $n_{ij} \approx 0$, the original definition of odds ratio will be problematic. Thus, we will use the amended odds ratio, which is defined as
<font size = 5.5>
$\theta = \frac{(n_{11} + 0.5)(n_{22} + 0.5)}{(n_{21} + 0.5)(n_{12} + 0.5)}$
</font>
## Case contral study

When we control the number of columns, only if we have $P(L)$, i.e. the prevalence of lung cancer can we determine the value of relative risk.
## Pearson $\chi^2$ test
This test is to test whether X and Y are independent, so

And we have stated the situation of X and Y are independent, which is
<font size = 5.5>
$\pi_{ij} = \pi_{i+}\pi_{+j}$
</font>
so here we get to know that

And the test statistic is thus to be defined as

Where $\chi^2$ is a chi-square distribution with $df = (i-1)(j-1)$
So where dose the df comes from?
定義$ij-1$:有效樣本數,因為在類別資料分析中,會把所有的樣本區分成$ij$個類,所以真正能夠提供資訊的是類,而不是原始的樣本數,所以稱有效樣本數。
而在row跟column的部分各要估計$(i-1)$跟$(j-1)$個參數$(\mu_{ij})$,所以再扣掉$(i-1) + (j-1)$,則得到
$(ij - 1) - [(i-1) + (j-1)] = (i-1)(j-1)$
## SAS inplementation

1. 一般的卡方結果
2. Likelihood ratio test($G^2$)
3. 先amend過,就是前面的+0.5之後再做的卡方
Note:
1. When the sample size is large $\chi^2 \approx G^2$
2. Let $c = ij - 1$, then when $\frac{n}{c} < 5$, the approximation for $G^2$ to chi-square distribution will be poor
3. If $\frac{n}{c} \approx 1$, then the approximation for $\chi^2$ will be very good.
## Pearson residual
上面的卡方檢定的結果只有拒絕和不拒絕,沒辦法對cell多做解釋,用pearson residual則可進一步解釋
<font size = 5.5>
$r_{ij} = \frac{n_{ij} - \hat{\mu_{ij}}}{\sqrt{\hat{\mu_{ij}}}}$
</font>
if $r_{ij} > 0$,則代表預測值低估,反之則高估
## Partition chi-square
對於一個完整的contingency table,如果其中幾個變項的行為相同或不同,可以透過partition的方式來切割表格,這樣就可以看出哪幾個變項其實是可以整合的。這也可以用在如果有幾個cell的值太小,在卡方的時候會出問題,但在切割的時候需要遵守以下原則。而要切割前可以先看row percent來大概決定那些變項是可以整併的。
1. subtable的df和要等於原來table的df
2. 每一個在原本的cell出現的數字都必須且只能在subtable出現一次
3. 每一個marginal出現的數字都必須且只能在subtable出現一次
Note: 將切割出來的subtable的卡方值加起來便會是原本table的卡方值
## Proportional reduction
To measure how well $X$ can predict $Y$ in categorical data, we define propotional reduction in error as
<font size = 5.5>
$PRE = \frac{V(Y) - V(Y|X)}{V(Y)}$
</font>
where
$V(Y) = V({\pi_{+j}})$: the variation of marginal distribution of $Y$
$V(Y|X) = V({\pi_{1|i}, \pi_{2|i}, ...., \pi_{j|i}})$: the variation of conditional distribution give that $X = i$
因此這個參數代表的涵義就是,當給定$X$之後,原本$Y$的變異有多少可以被$X$所解釋,值越大代表說$X$這個變數的解釋能力越好,整體概念跟R-square很像。
In categoriacal data, we can't not define variance, so we re-define the variation as **entropy**:
<font size = 5.5>
$V(Y) = \sum_{j} \pi_{+j}log\pi_{+j}$
</font>
Then the proportional reduction can be expressed as **uncertainty coefficients**:
<font size = 5.5>
$U = -\frac{\sum_{i}\sum_{j} \pi_{ij} log\frac{\pi_{ij}}{\pi_{i+}\pi_{+j}}}{\sum_{j} \pi_{+j} log \pi_{+j}}$
</font>
### code
``` sas=
proc freq data = <data>;
weight count;
table <explanetory> * <response> / measures;
run;
```
### SAS implementation

1是將column當作response variable,row當作explanetory variable
2則反之
3則是將兩個平均
## Linear tendency of ordinal data
因為前面的likelihood ratio test跟test of independency皆只有考慮類別,但如果資料是帶有順序的,此時這兩個test就會失真。所以可以將資料assign score當成連續變數之後再做檢定。
而其test statistic為

$r$為相關係數,其假設檢定為$H_0: r = 0, H_a : r \neq0$
如果assign score的方法是用Midrank,則其相關係數稱為Spearman's rho

### code
``` sas=
proc freq data = <data>;
weight count;
table <explanetory> * <response> / measures scores = [rank][rigit];
run;
```
/ measures scores = rank 會將correlation用midrank來計算,此時Pearson correlation跟Spearman correlation的值會相同

/ measures scores = ridit 則是將midrank再除以n

{"metaMigratedAt":"2023-06-16T12:03:09.092Z","metaMigratedFrom":"Content","title":"Linear tendency","breaks":true,"contributors":"[{\"id\":\"4dfc5c37-4e48-4073-b5f6-dc3a90f80079\",\"add\":4263,\"del\":284}]"}