:dart: W12 - Logistic Regression

:dart: W12 - Logistic Regression ===  ## 名字：伊瑩 ### logistic regression * 迴歸分析可用來確認兩個變數之間的因果關係，羅吉斯迴歸主要用於依變數為二維變數(0,1)的時候，也就是說它主要用來處來二元問題 * 線性迴歸中的依變數通常為連續型變數，但羅吉斯迴歸所探討的依變數主要為類別變數 * 不過羅吉斯迴歸分析中，自變數可以是類別變數，也可以是連續變數。 * e.g.年齡、性別、是否抽煙對於是否得到某一疾病的影響 ![](https://i.imgur.com/rNrcEi0.png) * 羅吉斯分配中，自變數對依變數的影響是以指數的方式做變動，因此不需要常態分配的假設。 ![](https://i.imgur.com/WNJItgO.png) ### odd ratio 賠率比/勝算比 * Odds of an event happening is defined as the likelihood that and event will occur, expressed as a proportion of the liklihood that the event will not occur. ![](https://i.imgur.com/3fN3jPH.png) * e.g. 如果使用A藥的人罹癌的勝算(odds)是2.33，沒使用A藥的人罹癌的勝算是0.67，那與沒使用A藥的人比起來，使用A藥的人罹癌的勝算是他們的 3.48 倍 (2.33/0.67)，所以勝算比(odds ratio)就是3.48。 * 在logistic regression model挑變數用odds ratio而不用probability或relative risk的原因： * 可以用在前瞻性研究(prospective study)與回溯性研究(retrospective study)及病例對照研究(Case–control study)，適用範圍比起Relative risk較廣。 * In logistic regression the odds ratio represents the "constant effect" of a predictor X, on the likelihood that one outcome will occur. In regression models, we often want a measure of the unique effect of each X on Y. If we try to express the effect of X on the likelihood of a categorical Y having a specific value through probability, the effect is not constant. * In other words, it means that there is no way to express in one number how X affects Y in terms of probability. * 在logistic regression 中的 odd ratio都會取log值 ### variationist approach * proposed by Willian Labov * The approach is concerned with the study of variation and change in language. The theory proceeds from the assumptions that linguistic variation is patterned both socially and linguistically, and that such patterns can be discovered only through systematic investigation of a speech community * discover patterns in the distribution of alternative ways of saying the same thing, that is, the social and linguistic factors that are responsible for variation * 經常用在discourse analysis [Reference1](https://psychscenehub.com/psychpedia/odds-ratio-2/) [Reference2](http://sub.chimei.org.tw/57300/images/05_research/1090518-Logistical_regression.pdf) [Reference3](https://www.theanalysisfactor.com/why-use-odds-ratios/) [Reference4](https://www.yongxi-stat.com/logistic-regression/) [Reference5](https://www.youtube.com/watch?v=3tq4t41MsPc) [Reference6](https://www.researchgate.net/profile/Ayo-Osisanwo/publication/343214812_DISCOURSE_ANALYSIS/links/5f1c9b5445851515ef4a929b/DISCOURSE-ANALYSIS.pdf) --- ## 名字：標云 #### Logistic Regression: Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables. (迴歸分析可用來確認兩個變數之間的因果關係，用自變項independent variable預測依變項，或是研究實驗控制(因)對被觀察的變項(果) 的影響) E.g., 由父母身高預測子女身高 * Binary Logistic Regression Major Assumptions: 1. The dependent variable should be dichotomous in nature (e.g., presence vs. absent). 2. There should be no outliers in the data, which can be assessed by converting the continuous predictors to standardized scores, and removing values below -3.29 or greater than 3.29. 3. There should be no high correlations (multicollinearity) among the predictors. This can be assessed by a correlation matrix among the predictors. Tabachnick and Fidell (2013) suggest that as long correlation coefficients among independent variables are less than 0.90 the assumption is met. ![](https://i.imgur.com/uDIuxv7.png) * outcome/依變數(Y)是二元變數(binary variable)的迴歸模型 ![](https://i.imgur.com/KJHyqHh.png) * Overfitting When selecting the model for the logistic regression analysis, another important consideration is the model fit. Adding independent variables to a logistic regression model will always increase the amount of variance explained in the log odds (typically expressed as R²). However, adding more and more variables to the model can result in overfitting, which reduces the generalizability of the model beyond the data on which the model is fit. * Reporting the R^2^ Numerous pseudo-R^2^ values have been developed for binary logistic regression. These should be interpreted with extreme caution as they have many computational issues which cause them to be artificially high or low. A better approach is to present any of the goodness of fit tests available; Hosmer-Lemeshow is a commonly used measure of goodness of fit based on the Chi-square test. * Hosmer-Lemeshow The Hosmer-Lemeshow test (HL test) is a goodness of fit test for logistic regression, especially for risk prediction models. A goodness of fit test tells you how well your data fits the model. Specifically, the HL test calculates if the observed event rates match the expected event rates in population subgroups.The test is only used for binary response variables (a variable with two outcomes like alive or dead, yes or no). Data is first regrouped by ordering the predicted probabilities and forming the number of groups, g (通常g = 10). The Hosmer-Lemeshow test statistic is calculated with the following formula (which is for the 10-group case—modify for your specific number of groups): ![](https://i.imgur.com/DtSssJd.png) ![](https://i.imgur.com/nCKhkNe.png) This test is usually run using technology. The output returns a chi-square value (a Hosmer-Lemeshow chi-squared) and a p-value (e.g. Pr > ChiSq). Small p-values (usually under 5%) mean that the model is a poor fit. But large p-values don’t necessarily mean that your model is a good fit, just that there isn’t enough evidence to say it’s a poor fit. Many situations can cause large p-values, including poor test power. Low power is one of the reasons this test has been highly criticized. [Resource 1](https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/what-is-logistic-regression/) [Resource 2](http://sub.chimei.org.tw/57300/images/05_research/1071019.pdf) [Resource 3](https://www.statisticshowto.com/hosmer-lemeshow-test/) --- ## 名字：孟青 ### Cross tabulation (contingency table) * is used to quantitatively analyze the relationship between multiple variables. * most often used to analyse categorical (nominal measurement scale) data. * allows for the identification of patterns, trends, and probabilities within datasets. ![](https://i.imgur.com/RxWIU7m.png) > Typical cross-tabulation table compares two hypothetical variables, in this case: "City of Residence" with "Favorite Baseball Team" * uncover variables or multiple variables that affect a specific result or can aid in improving a specific outcome. [Source](https://www.qualtrics.com/au/experience-management/research/cross-tabulation/) ### Logistic regression * 由線性迴歸變化而來，目標是要找出一條直線能夠將所有數據清楚地分開並做分類 * 線性迴歸中的依變數(Y)通常為連續型變數，但logistic regression所探討的依變數(Y)主要為類別變數(自變數x可以是類別變數，也可以是連續變數。) * ![](https://i.imgur.com/PdQXf2p.png) ![](https://i.imgur.com/kePTnPl.png) * 對勝算比(Odds Ratio)取對數建立回歸方程式 <img src="https://i.imgur.com/ul4njkB.png" width=80%> > 利用Maximum Likelihood Estimation(MLE)求logistic regression參數(w) * 經過Sigmoid函數的轉換，讓最後的結果介於0與1之間--> 機率 <img src="https://i.imgur.com/GwOarDD.png" width=60%> > sigmoid中間的斜率非常的大，主要目的就是要讓這些事件經轉換後的機率愈趨近於0%或100%，有效達成二元分類。 #### Multivariable logistic regression * multivariate analysis指的是具有兩個或以上的依變數或結果變數的統計模式 * multivariate指的多變量通常源自於縱貫性研究（longitudinal study)，其中，同一個個體的測量（重複測量）於多個時間點上進行。或者，multivariate指的是套疊/鑲嵌（clustered/nested）資料，其每一個集群包含著多個個體。 * multivariable analysis指的是具有多個自變數或反應變數（response variable）的統計模式。 * multivariable model: one dependent outcome and multiple independent (predictor or explanatory) variables. * multivariate logistic regression: ![](https://i.imgur.com/LWEc7Ig.png) > $X_1$,$X_2$ .... $X_n$ [source](https://ithelp.ithome.com.tw/m/articles/10269006) [source](https://chih-sheng-huang821.medium.com/機器-統計學習-羅吉斯回歸-logistic-regression-aff7a830fb5d) [source](https://pyecontech.com/2020/01/04/logistic_regression/) --- ## 名字：雞丁 ### Logistic regression * 用於判斷二元分類(1/0) * 嘗試解決linear probability model的問題 * Unbounded predicted probabilities: * 因為是線性預測，所以機率會多於一或小於零 ![](https://i.imgur.com/klNPJvD.png) * Non-normality of the error term * Heteroskedasticity * Linear Regression的dependent variable是連續性(continous)的, Logistic Regression的dependent variable是離散性(discrete)的。 * 透過logistic distribution的累積分布函數(Cumulative Distribution Function,CDF)做出圖形為S-Curve的線性迴歸，將dependent variable限縮在0~1。 * 透過odds-ratio的對數值計算出$logit(Odds)$，再透過Sigmoid Function將$logit(Odds)$轉換成0到1的數值： $logit(Odds)=ln({{p_i} \over {1-p_i}})$ * Sigmoid Function ![](https://i.imgur.com/H9d9tBQ.png) * 比較一下linear regression和logistic regression的圖形 ![](https://i.imgur.com/T9yTge8.png) #### Multivariable logistic regression * Logistic Regression的延伸，每一個dependent variable (y)，對應了許多不同的independent variabe (x)。 #### Multivariate logistic regression * Logistic Regression的延伸，每一個independent variable (x)，對應了許多不同的dependent variabe (y)。 ### 交叉表（cross-tabulation） * Cross Tabulations是一種常用的分類匯總表格。 * 可以用於量化分析多變數的資料。 * 可以更細緻的察看細節的相關資訊。 * 一般的統計表格大多只有呈現整個group的總數，cross tabulation則可以看到subgroup的訊息。 * 因此cross tabulation提供的資訊，可以提供更微觀的資訊，在分析上可以更加深入一些常被忽略的資訊。 * 以下方的cross tabulation為例，當我們要看Male對應的Very happy時，一般的統計表格僅會提供Frequency，但在下方的cross tabulation中，我們可以看到Very happy中Male和Female各佔了多少百分比 (Row %)，也可以看到Male中Very happy, Pretty happy和Not too happy各佔了多少百分比(Column %)。 ![](https://i.imgur.com/k113tMP.png) ### References * [3 Main Linear Probability Model (LPM) Problems](https://www.dummies.com/article/business-careers-money/business/economics/3-main-linear-probability-model-lpm-problems-156466/) * [How simple logistic regression differs from simple linear regression](https://www.graphpad.com/guides/prism/latest/curve-fitting/reg_simple_logistic_and_linear_difference.htm) * [Youtube:線性機率模型 (LPM) 與邏輯斯迴歸 (Logistic Regression)](https://youtu.be/KDGdIrLvALk) * [Youtube:機器學習首部曲---邏輯斯迴歸簡介(Logistic Regression)](https://youtu.be/vtMrtzYrPDI) * [Linear Regression vs. Logistic Regression](https://www.dummies.com/article/technology/information-technology/data-science/general-data-science/linear-regression-vs-logistic-regression-268328/) * [百度百科：交叉表](https://www.easyatm.com.tw/wiki/%E4%BA%A4%E5%8F%89%E8%A1%A8) * [How Cross Tabulation Makes Your Data More Actionable](https://www.alchemer.com/resources/blog/cross-tabulation/) --- ## 名字：譚欣 ### multivariable logistic regression #### multivariable model - 1 dependent variable vs more than 1 independent variable - 一個 outcome 有著不同的 covariables ![](https://i.imgur.com/QrJeaRJ.png) - multivariate model - 不只有一個 outcome，outcome 會是一個 vector - 一些 multivariable 的模型 ![](https://i.imgur.com/PN7Wfzy.png) - linear regression: continuous - logistic regression: binary - cox regression: time to event (with t0 as baseline) #### logistic regression model - 他的目的是找出一條線將所有數據做分類 - 是一個分類模型：outcome 是 1 or 0 - outcome 是 probability - valuate 一個 variable effect size (用 OR 表示) - 在上述的 function (所有的 X 乘以 weights)之後，它還會有一個 sigmoid 函數，將線變成一個在 -1 到 1 的平滑的直線上。 ![](https://i.imgur.com/rQi7XEK.png) - 只要 z > 0, 則 A 的機率 > 0.5 ，就會判給其中一類，反之，就會判給另外一類。 - 在這之後，它會用 cross entropy 來確定輸出模型跟目標越近，如果出現 error 便會到頭從新改變 weighting - LR 和 linear 的比較 ![](https://i.imgur.com/hKfhmwU.png) [source](https://medium.com/jameslearningnote/資料分析-機器學習-第3-3講-線性分類-邏輯斯回歸-logistic-regression-介紹-a1a5f47017e5) [source](https://ithelp.ithome.com.tw/m/articles/10269006) Reference: Stuart W Grant, Graeme L Hickey, Stuart J Head, Statistical primer: multivariable regression considerations and pitfalls, European Journal of Cardio-Thoracic Surgery, Volume 55, Issue 2, February 2019, Pages 179–185, https://doi.org/10.1093/ejcts/ezy403 ### subjunctive mood 虛擬語態 - indicative mood: real - to state fact and to make a declaration - subjunctive mood: probability, subjectivity, doubt - how the speaker feels, speakers attitude - express and wish, intent or command - usually used in subordinate clauses, though in some sentences (with probability expressed by adverb), it can be the main verb [source](https://www.inmsol.com/spanish-grammar/subjunctive-mood/) --- ## 名字：宜君 ### Logistic regression * linear regression vs. logistic regression (LR)： * linear regression: 當自變項增加一個單位，依變項會增加多少單位 * LR：當自變項增加一個單位，依變項 1 相對依變項 0 的機率會增加幾倍 * 勝算比（Odds ratio)：自變項增加一個單位，依變項有發生狀況（Event）相對於沒有發生狀況（non-event）的比值 * 分類模型的一種，常使用於二元分類，例如：在 A, B 分類時，判斷為 A 類的機率愈高則分為 A 類 * 目的：找到一條能將資料分為二類的線，示意圖如下： <img src="https://i.imgur.com/c3e8dqk.png" width='85%'> * LR 模型圖如下： ![](https://i.imgur.com/0fTXLgo.png) * LR 中的 Sigmoid 函數可將 output 值，也就是 y 值壓縮在 0-1 之間，而這也符合機率的範圍 <img src="https://i.imgur.com/dgsjLnQ.png" width='80%'> * LR 計算步驟： * Step 1: 將勝算比取 log 得到 y 值 * Step 2: 透過 Sigmoid function 將數值壓縮在 0-1 之間 * Step 3: 最終結果 > 0.5 表示有勝算比，< 0.5 表示沒勝算比，大於 50% 的會被預測為 1，小於 50% 會被預測為 0 [Reference 1](https://medium.com/jameslearningnote/資料分析-機器學習-第3-3講-線性分類-邏輯斯回歸-logistic-regression-介紹-a1a5f47017e5) [Reference 2](https://ryanisagoodguy.blogspot.com/2015/08/logistic-regression.html) [Reference 3](https://matters.news/@CHWang/95921-machine-learning-給自己的機器學習筆-logistic-regression邏輯迴歸-二元分類問題-原理詳細介紹-bafyreiettlsnp4azq5dqwyubb5w76j5t4x4pxxhnuteofrqmqszg4jbuve) ### Multivariable logistic regression * multivariable analysis vs. multivariate analysis * multivariable analysis: one dependent variable (outcome) and multiple independent variables * multivariate analysis: more than one dependent variable and multiple independent variables * multivariable analysis <img src="https://i.imgur.com/dphuFbI.png" width='75%'> * Three types of multivariable regression model * linear regression (continuous variables) * logistic regression (binary variables) * cox regression (time-to-event outcomes) [Reference 1](https://academic.oup.com/ntr/article/23/8/1446/5812038) Ebrahimi Kalan, M., Jebai, R., Zarafshan, E., & Bursac, Z. (2021). Distinction between two statistical terms: multivariable and multivariate logistic regression. *Nicotine and Tobacco Research, 23*(8), 1446-1447. [Reference 2](https://academic.oup.com/ejcts/article/55/2/179/5265263) Grant, S. W., Hickey, G. L., & Head, S. J. (2019). Statistical primer: multivariable regression considerations and pitfalls. *European Journal of Cardio-Thoracic Surgery, 55*(2), 179-185. --- ## 名字：盈蓓 ### Logistic Regression * 一種分類模型。 * 目的是要找出能將資料最好分類的一條線。 * 計算出來的值會是介於0到1之間的機率。 * 跟線性回歸不同之處在於，線性回歸是嘗試找出最符合資料分配的一條直線，而Logistic Regression是要將資料進行二元分類。 ![](https://i.imgur.com/GQIi46r.png) * **Sigmoid function (Logistic Function)**: 一個y值介於0~1之間的函數（符合機率要介於0~1的特質） ![](https://i.imgur.com/n1CVtQV.png) [Reference 1](https://medium.com/jameslearningnote/%E8%B3%87%E6%96%99%E5%88%86%E6%9E%90-%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E7%AC%AC3-3%E8%AC%9B-%E7%B7%9A%E6%80%A7%E5%88%86%E9%A1%9E-%E9%82%8F%E8%BC%AF%E6%96%AF%E5%9B%9E%E6%AD%B8-logistic-regression-%E4%BB%8B%E7%B4%B9-a1a5f47017e5) [Reference 2](https://ithelp.ithome.com.tw/m/articles/10269006) ### Multivariable Logistic Regression * **Multivariable V.S. Multivariate** * 只有一個因變數（dependent variable） V.S 有多個因變數 * 都有多個independent variable * 運用多個independent variable去找出最合適的直線。 * 如果是一般的Logistic regression，公式會是這樣： $$logit[\pi(x)] = log(\frac{\pi(x)}{1-\pi(x)}) = \beta_0+\beta_1x$$ * 如果是multivariable： $$log(\frac{\pi(x_i)}{1-\pi(x_i)}) = \beta_0+\beta_1X_1+\beta_2X_2+...+\beta_nX_n$$ [Reference](https://academic.oup.com/ntr/article/23/8/1446/5812038) ## 名字：柏瑄 ### Odds ratio vs Pearson's correlation - Both measures the effect size - Odds ratio measures the odds of specific outcome in two groups - Pearson's correlation measures how much one factor can be explain by the other ### multivariate vs. multivaiable logistic regression - multivariable: one dependent outcome & multiple independent variables - multivariate: **multiple** dependent outcomes & multiple independent variables ### Subjunctive mood 虛擬式虛擬式在英文中已經不太明顯了，但在羅曼語中還很常見。常常用來表達說話者對於事件的不確定、和事實的不相符。英文的現在虛擬式和第一人稱單數的詞性變化相同： He suggests that the man **go** home right now. 在法文中subjunctive只出現在subordinate clause： Je souhaite que tu sois/\*es là. "I wish you were here." (*es* indicative; *sois* subjunctive) 在西班牙文中沒有這個限制，因此西班牙文subjunctive的使用會造成語意不同。 --- ## 宜庭 ### 迴歸分析 * 迴歸分析檢查一個或多個預測變項 (predictor variables) 和反應變項 (response variables) 之間的關係。 * 反應變項 (response variable)：任何迴歸分析中最重要的變項，為研究者有興趣的研究的變項，又稱 dependent variable 或 response。 * 預測變項 (predictor variable)：假設可能對反應變項產生影響的變項，又稱 independent variable 或 predictors。 * Multivariate vs Multivariable * Multivariate Regression: 具兩個或兩個以上反應變項 (dependent variables) 的統計模式。 ![](https://i.imgur.com/uGgVQGz.png) * Multivariable Regression: 具有多個預測變項 (independent variables) 的統計模型 ![](https://i.imgur.com/5KufJgJ.png) > X: independent variable > Y: dependent variable [Reference 1](http://sub.chimei.org.tw/57300/images/05_research/1071019.pdf) ### Logistic Refression * 由線性迴歸變化而來，為一分類模型，目標在於找出一條直線可以將資料劃分。 * Logistic Regression 的運作： ![](https://i.imgur.com/hb63Ad3.png) > *w*: weight > *b*: bias > 將輸入分別乘上 *w*，加總後再加上 *b*，便可以得到 *z*。而後通過 sigmoid function 得到事後機率 (posterior probability) 的輸出。 ![](https://i.imgur.com/83OXjDz.png) 在計算 probability 時，就已輸入 sigmoid function 做轉換。取 log 後，以結果而言為線性。 * Logistic Function (Sigmoid Function)：使輸出值 (y 值) 介於 0-1 之間 → ==機率值==。 ![](https://i.imgur.com/myPVquB.png) [Reference 2](https://books.google.com/books?hl=zh-TW&lr=&id=Qf8XBQAAQBAJ&oi=fnd&pg=PA487&dq=speelman+2014+logistic+regression&ots=dw9oi9YTTb&sig=N3bVAQS3MXxCbXLxfTSIdrcWlcI) [Reference 3](https://medium.com/jameslearningnote/%E8%B3%87%E6%96%99%E5%88%86%E6%9E%90-%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E7%AC%AC3-3%E8%AC%9B-%E7%B7%9A%E6%80%A7%E5%88%86%E9%A1%9E-%E9%82%8F%E8%BC%AF%E6%96%AF%E5%9B%9E%E6%AD%B8-logistic-regression-%E4%BB%8B%E7%B4%B9-a1a5f47017e5) [Reference 4](https://ithelp.ithome.com.tw/m/articles/10269006) ## 名字：Milan ### 以下如果要用到標題請打三個以上的井字號 ### 羅吉斯迴歸 * **def**: 分析與解釋一個名義尺度的反應變數與一個以上的解釋變數間之關係，基本假設與線性迴歸類似 * **purpose**: 為了要找出類別型態的反應變數和解釋變數之間的關係，因此和簡單線性迴歸分析中最大的差別在於反應變數型態的不同(就是要預測類別變數的話) * **characteristic**: 避免解釋變數之間共線性的問題，以及符合常態分配等的基本假設 * 欲分析變數不得為連續型變數 * **presumption** * 自變數對依變數的影響是以指數的方式做變動，因此不需要常態分配的假設。 ![](https://i.imgur.com/CAUNVgt.png) * 令依變數Y為二元反應的變數(成功或失敗)，p為其成功的機率，受自變數ｘ所影響 $$Y事件成功的機率 = \frac{e^{f(x)}}{1+e^{f(x)}}$$ $$Y事件失敗的機率 = 1-p$$ * Odds ratio: $\frac{p}{1-p} = e^{f(x)}$ ### 勝算比 * odds: 事件發生機率 / 事件不發生機率 * odds ratio (OR): 與非特定情況之事件結果相比，在特定情況下會發生結果的比率。 <div align="center"> <img src="https://i.imgur.com/Rzx4GBk.png" width=60% > </div> * $$ OR = \frac{a/c}{b/d} \ = \frac{ad}{bc} \hspace{50cm} $$ * OR = 1: 特定情況並不影響結果機率 * OR > 1: 特定情況與更高的結果機率相關 * OR < 1: 特定情況與較低的結果機率相關 * odds ratio 可用以確定特定情況是否影響事件的結果，並比較該結果的影響因素大小 ref: 宜庭 in week2 ###### tags: `QL`