Linear tendency

# Linear tendency ###### tags: `Categorical data analysis` ## Amended odds ratio When any $n_{ij} \approx 0$, the original definition of odds ratio will be problematic. Thus, we will use the amended odds ratio, which is defined as $\theta = \frac{(n_{11} + 0.5)(n_{22} + 0.5)}{(n_{21} + 0.5)(n_{12} + 0.5)}$ ## Case contral study ![](https://i.imgur.com/Lj9fWoV.jpg) When we control the number of columns, only if we have $P(L)$, i.e. the prevalence of lung cancer can we determine the value of relative risk. ## Pearson $\chi^2$ test This test is to test whether X and Y are independent, so ![](https://i.imgur.com/Vk9XpC7.png) And we have stated the situation of X and Y are independent, which is $\pi_{ij} = \pi_{i+}\pi_{+j}$ so here we get to know that ![](https://i.imgur.com/8e71mNe.png) And the test statistic is thus to be defined as ![](https://i.imgur.com/5YsExKG.png) Where $\chi^2$ is a chi-square distribution with $df = (i-1)(j-1)$ So where dose the df comes from? 定義$ij-1$:有效樣本數，因為在類別資料分析中，會把所有的樣本區分成$ij$個類，所以真正能夠提供資訊的是類，而不是原始的樣本數，所以稱有效樣本數。而在row跟column的部分各要估計$(i-1)$跟$(j-1)$個參數$(\mu_{ij})$，所以再扣掉$(i-1) + (j-1)$，則得到 $(ij - 1) - [(i-1) + (j-1)] = (i-1)(j-1)$ ## SAS inplementation ![](https://i.imgur.com/mmGMxX3.jpg) 1. 一般的卡方結果 2. Likelihood ratio test($G^2$) 3. 先amend過，就是前面的+0.5之後再做的卡方 Note: 1. When the sample size is large $\chi^2 \approx G^2$ 2. Let $c = ij - 1$, then when $\frac{n}{c} < 5$, the approximation for $G^2$ to chi-square distribution will be poor 3. If $\frac{n}{c} \approx 1$, then the approximation for $\chi^2$ will be very good. ## Pearson residual 上面的卡方檢定的結果只有拒絕和不拒絕，沒辦法對cell多做解釋，用pearson residual則可進一步解釋 $r_{ij} = \frac{n_{ij} - \hat{\mu_{ij}}}{\sqrt{\hat{\mu_{ij}}}}$ if $r_{ij} > 0$，則代表預測值低估，反之則高估 ## Partition chi-square 對於一個完整的contingency table，如果其中幾個變項的行為相同或不同，可以透過partition的方式來切割表格，這樣就可以看出哪幾個變項其實是可以整合的。這也可以用在如果有幾個cell的值太小，在卡方的時候會出問題，但在切割的時候需要遵守以下原則。而要切割前可以先看row percent來大概決定那些變項是可以整併的。 1. subtable的df和要等於原來table的df 2. 每一個在原本的cell出現的數字都必須且只能在subtable出現一次 3. 每一個marginal出現的數字都必須且只能在subtable出現一次 Note: 將切割出來的subtable的卡方值加起來便會是原本table的卡方值 ## Proportional reduction To measure how well $X$ can predict $Y$ in categorical data, we define propotional reduction in error as $PRE = \frac{V(Y) - V(Y|X)}{V(Y)}$ where $V(Y) = V({\pi_{+j}})$: the variation of marginal distribution of $Y$ $V(Y|X) = V({\pi_{1|i}, \pi_{2|i}, ...., \pi_{j|i}})$: the variation of conditional distribution give that $X = i$ 因此這個參數代表的涵義就是，當給定$X$之後，原本$Y$的變異有多少可以被$X$所解釋，值越大代表說$X$這個變數的解釋能力越好，整體概念跟R-square很像。 In categoriacal data, we can't not define variance, so we re-define the variation as **entropy**: $V(Y) = \sum_{j} \pi_{+j}log\pi_{+j}$ Then the proportional reduction can be expressed as **uncertainty coefficients**: $U = -\frac{\sum_{i}\sum_{j} \pi_{ij} log\frac{\pi_{ij}}{\pi_{i+}\pi_{+j}}}{\sum_{j} \pi_{+j} log \pi_{+j}}$ ### code ``` sas= proc freq data = <data>; weight count; table <explanetory> * <response> / measures; run; ``` ### SAS implementation ![](https://i.imgur.com/1a92jyT.jpg) 1是將column當作response variable，row當作explanetory variable 2則反之 3則是將兩個平均 ## Linear tendency of ordinal data 因為前面的likelihood ratio test跟test of independency皆只有考慮類別，但如果資料是帶有順序的，此時這兩個test就會失真。所以可以將資料assign score當成連續變數之後再做檢定。而其test statistic為 ![](https://i.imgur.com/pZHKBej.jpg) $r$為相關係數，其假設檢定為$H_0: r = 0, H_a : r \neq0$ 如果assign score的方法是用Midrank，則其相關係數稱為Spearman's rho ![](https://i.imgur.com/1SEuEWg.jpg) ### code ``` sas= proc freq data = <data>; weight count; table <explanetory> * <response> / measures scores = [rank][rigit]; run; ``` / measures scores = rank 會將correlation用midrank來計算，此時Pearson correlation跟Spearman correlation的值會相同 ![](https://i.imgur.com/SaAPAAD.jpg) / measures scores = ridit 則是將midrank再除以n ![](https://i.imgur.com/IhHLt0f.jpg)