Logistic regression

# Logistic regression What is the difference between regression and Logistic regression ? * Logistic regression deals with classification problems. $Y=P \in \{1,0 \}$ * Unlike regression $Y_i \overset{\text{iid}}{\sim} N(\mu, \sigma^2)$, Logistic regression $Y_i \overset{\text{iid}}{\sim}$ Logistic distribution. > Logistic distribution $$\begin{equation} \begin{split} cdf: f(x|\mu, s) &= \frac{1}{1 + e^{-\frac{x-\mu}{s}}}\\ pdf: F(x|\mu, s) &= \frac{e^{-\frac{x-\mu}{s}}}{s(1 + e^{-\frac{x-\mu}{s}})^2} \end{split} \end{equation}$$ [reference](https://www.sciencedirect.com/topics/mathematics/logistic-distribution) ## Logistic regression $P_i$ Logistic regression define $P_i(success)=P_i(Y_i = 1)$, Fail is $1-P_i(Y_i = 1)$, and odds $= \frac{p_i}{1-p_i}$ $$\begin{equation} \begin{split} P_i &= \frac{1}{1+e^{-(\beta_0+\beta_1x)}} = \frac{e^{(\beta_0+\beta_1x)}}{1+e^{(\beta_0+\beta_1x)}}\\ odds&=\frac{P_i}{1-P_i}=e^{\beta_0+\beta_1x} \\ ln\space odds &= ln\frac{P_i}{1-P_i}=\beta_0+\beta_1x \end{split} \end{equation}$$ therefore, logistic regression is formulated as $$Logistic(p_i)=ln\frac{P_i}{1-P_i}=\beta_0+\beta_1x$$ Noted, target $P_i(Y_i =1)=1$ or $P_i(Y_i =0)=0$ will cause the $ln\frac{1}{0}$ or $ln\frac{0}{1}$. Therefore, we can't us OLS estimate $\beta$. For instance. $$\begin{equation} \begin{split} RSS=\sum^{N}_{i=1} (y_i - \hat{y_i})^2 &\Rightarrow \sum^{N}_{i=1} (ln \frac{p}{1-p} - ln\frac{\hat{p}}{1-\hat{p}})^2\\ &=\sum^{N}_{i=1} (ln \frac{1}{0} - ln\frac{\hat{p}}{1-\hat{p}})^2\Rightarrow \text{can't do anymore} \end{split} \end{equation}$$ ## MLE for logistic regression ### Likelihood Function for Logistic Regression $$ \begin{equation} \begin{split} L(\beta_0,\beta_1)&= \prod^N_{i=1}p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\\ l(\beta_0,\beta_1)&= \sum^N_{i=1}y_i\space ln(p(x_i)) + (1-y_i)\space ln(1-p(x_i))\\\\ &\text{where $l(\beta_0,\beta_1)$ is ln-likelihood} \end{split} \end{equation}$$ Let’s substitute $p$ with its exponent form $$ \begin{equation} \begin{split} l(\beta_0,\beta_1)&= \sum^N_{i=1}y_i\space ln(p(x_i)) + (1-y_i)\space ln(1-p(x_i))\\ l(\beta_0,\beta_1)&=\sum^N_{i=1}y_i\space ln(p(x_i)) + ln(1-p(x_i))-y_i ln(1-p(x_i))\\ &=\sum^N_{i=1}y_i\left[ ln(p(x_i)-ln(1-p(x_i)))\right]+ln(1-p(x_i))\\ &= \sum^N_{i=1}y_i\space \left[ln(\frac{p(x_i)}{1-p(x_i)})\right]+ln(1-p(x_i))\\ &=\sum^N_{i=1}y_i\space \left[ \beta_0+\beta_1x_i\right]+ln(\frac{1}{1+e^{\beta_0+\beta_1x_i}})\\ &=\sum^N_{i=1}y_i\space \left[ \beta_0+\beta_1x_i\right]-ln(1+e^{\beta_0+\beta_1x_i}) \end{split} \end{equation}$$ ### Maximizing Log-likelihood function solve system of equations $$ \left\{ \begin{equation} \begin{split} \frac{\partial l(\beta_0,\beta_1)}{\partial \beta_1}&=0\\ \frac{\partial l(\beta_0,\beta_1)}{\partial \beta_0}&=0 \end{split} \end{equation} \right. $$ $$ \left\{ \begin{equation} \begin{split} &\frac{\partial l(\beta_0,\beta_1)}{\partial \beta_1}=\sum^N_{i=1} y_ix_i-x_i\frac{e^{\beta_0+\beta_1x_i}}{1+e^{\beta_0+\beta_1x_i}}=\sum^N_{i=1} x_i(y_i-p(x_i))=0\\ &\frac{\partial l(\beta_0,\beta_1)}{\partial \beta_0}=\sum^N_{i=1} y_i-\frac{e^{\beta_0+\beta_1x_i}}{1+e^{\beta_0+\beta_1x_i}}=\sum^N_{i=1}(y_i-p(x_i))=0 \end{split} \end{equation} \right . $$ which does not exist expression. But Newton's method is a second-order optimization algorithm that can help us find the best $\beta$ in our logistic function in fewer iterations compared to batch gradient descent. The generalization of Newton’s method to a multidimensional setting (also called the Newton-Raphson method) is given by: $$\beta := \beta - H^{-1}\bigtriangledown_{\beta}l(\beta)$$ where $$H_{ij} = \frac{\partial^2 l(\beta_)}{\partial \beta_0\partial \beta_1}$$ # Another way to consider the logistic problem Create a Linear Probability Model(LPM). $$\hat{y_i}=\beta_0+\beta_1x_i$$ we hope that $$E(y_i =1) = \beta_0+\beta_1x_i = P_i(Y=1)$$ | E | $y_i$ | $\varepsilon_i = y_i-E(y_i)$ | | ---------------------- | ----- | ---------------------------- | | $P_i=E(Y_i)$ | 1 | $1-P_i$ | | $1- P_i =1- E(Y_i)$ | 0 | $-P_i$ | | | $\sim$Bernoulli($P$) | | LPM exists Heteroscedasticity $$ \begin{equation} \begin{split} E(\varepsilon_i) &= (1-P_i)P_i+(-P_i)(1-P_i) =0\\ E(\varepsilon_i^2) &= P_i(1-P_i)\\ Var(\varepsilon_i)&= P_i(1-P_i)\\ \end{split} \end{equation}$$ where $Var(\varepsilon_i)= P_i(1-P_i)$ is related to $X$.