# Logistic regression
What is the difference between regression and Logistic regression ?
* Logistic regression deals with classification problems. $Y=P \in \{1,0 \}$
* Unlike regression $Y_i \overset{\text{iid}}{\sim} N(\mu, \sigma^2)$, Logistic regression $Y_i \overset{\text{iid}}{\sim}$ Logistic distribution.
> Logistic distribution
$$\begin{equation}
\begin{split}
cdf: f(x|\mu, s) &= \frac{1}{1 + e^{-\frac{x-\mu}{s}}}\\
pdf: F(x|\mu, s) &= \frac{e^{-\frac{x-\mu}{s}}}{s(1 + e^{-\frac{x-\mu}{s}})^2}
\end{split}
\end{equation}$$
[reference](https://www.sciencedirect.com/topics/mathematics/logistic-distribution)
## Logistic regression $P_i$
Logistic regression define $P_i(success)=P_i(Y_i = 1)$, Fail is $1-P_i(Y_i = 1)$, and odds $= \frac{p_i}{1-p_i}$
$$\begin{equation}
\begin{split}
P_i &= \frac{1}{1+e^{-(\beta_0+\beta_1x)}} = \frac{e^{(\beta_0+\beta_1x)}}{1+e^{(\beta_0+\beta_1x)}}\\
odds&=\frac{P_i}{1-P_i}=e^{\beta_0+\beta_1x} \\
ln\space odds &= ln\frac{P_i}{1-P_i}=\beta_0+\beta_1x
\end{split}
\end{equation}$$
therefore, logistic regression is formulated as
$$Logistic(p_i)=ln\frac{P_i}{1-P_i}=\beta_0+\beta_1x$$
Noted, target $P_i(Y_i =1)=1$ or $P_i(Y_i =0)=0$ will cause the $ln\frac{1}{0}$ or $ln\frac{0}{1}$. Therefore, we can't us OLS estimate $\beta$. For instance.
$$\begin{equation}
\begin{split}
RSS=\sum^{N}_{i=1} (y_i - \hat{y_i})^2 &\Rightarrow \sum^{N}_{i=1} (ln \frac{p}{1-p} - ln\frac{\hat{p}}{1-\hat{p}})^2\\
&=\sum^{N}_{i=1} (ln \frac{1}{0} - ln\frac{\hat{p}}{1-\hat{p}})^2\Rightarrow \text{can't do anymore}
\end{split}
\end{equation}$$
## MLE for logistic regression
### Likelihood Function for Logistic Regression
$$
\begin{equation}
\begin{split}
L(\beta_0,\beta_1)&= \prod^N_{i=1}p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\\
l(\beta_0,\beta_1)&= \sum^N_{i=1}y_i\space ln(p(x_i)) + (1-y_i)\space ln(1-p(x_i))\\\\
&\text{where $l(\beta_0,\beta_1)$ is ln-likelihood}
\end{split}
\end{equation}$$
Let’s substitute $p$ with its exponent form
$$
\begin{equation}
\begin{split}
l(\beta_0,\beta_1)&= \sum^N_{i=1}y_i\space ln(p(x_i)) + (1-y_i)\space ln(1-p(x_i))\\
l(\beta_0,\beta_1)&=\sum^N_{i=1}y_i\space ln(p(x_i)) + ln(1-p(x_i))-y_i ln(1-p(x_i))\\
&=\sum^N_{i=1}y_i\left[ ln(p(x_i)-ln(1-p(x_i)))\right]+ln(1-p(x_i))\\
&= \sum^N_{i=1}y_i\space \left[ln(\frac{p(x_i)}{1-p(x_i)})\right]+ln(1-p(x_i))\\
&=\sum^N_{i=1}y_i\space \left[ \beta_0+\beta_1x_i\right]+ln(\frac{1}{1+e^{\beta_0+\beta_1x_i}})\\
&=\sum^N_{i=1}y_i\space \left[ \beta_0+\beta_1x_i\right]-ln(1+e^{\beta_0+\beta_1x_i})
\end{split}
\end{equation}$$
### Maximizing Log-likelihood function
solve system of equations
$$
\left\{
\begin{equation}
\begin{split}
\frac{\partial l(\beta_0,\beta_1)}{\partial \beta_1}&=0\\
\frac{\partial l(\beta_0,\beta_1)}{\partial \beta_0}&=0
\end{split}
\end{equation}
\right.
$$
$$
\left\{
\begin{equation}
\begin{split}
&\frac{\partial l(\beta_0,\beta_1)}{\partial \beta_1}=\sum^N_{i=1} y_ix_i-x_i\frac{e^{\beta_0+\beta_1x_i}}{1+e^{\beta_0+\beta_1x_i}}=\sum^N_{i=1} x_i(y_i-p(x_i))=0\\
&\frac{\partial l(\beta_0,\beta_1)}{\partial \beta_0}=\sum^N_{i=1} y_i-\frac{e^{\beta_0+\beta_1x_i}}{1+e^{\beta_0+\beta_1x_i}}=\sum^N_{i=1}(y_i-p(x_i))=0
\end{split}
\end{equation}
\right .
$$
which does not exist expression. But Newton's method is a second-order optimization algorithm that can help us find the best $\beta$ in our logistic function in fewer iterations compared to batch gradient descent.
The generalization of Newton’s method to a multidimensional setting (also called the Newton-Raphson method) is given by:
$$\beta := \beta - H^{-1}\bigtriangledown_{\beta}l(\beta)$$
where
$$H_{ij} = \frac{\partial^2 l(\beta_)}{\partial \beta_0\partial \beta_1}$$
# Another way to consider the logistic problem
Create a Linear Probability Model(LPM).
$$\hat{y_i}=\beta_0+\beta_1x_i$$
we hope that
$$E(y_i =1) = \beta_0+\beta_1x_i = P_i(Y=1)$$
| E | $y_i$ | $\varepsilon_i = y_i-E(y_i)$ |
| ---------------------- | ----- | ---------------------------- |
| $P_i=E(Y_i)$ | 1 | $1-P_i$ |
| $1- P_i =1- E(Y_i)$ | 0 | $-P_i$ |
| | $\sim$Bernoulli($P$) | |
LPM exists Heteroscedasticity
$$
\begin{equation}
\begin{split}
E(\varepsilon_i) &= (1-P_i)P_i+(-P_i)(1-P_i) =0\\
E(\varepsilon_i^2) &= P_i(1-P_i)\\
Var(\varepsilon_i)&= P_i(1-P_i)\\
\end{split}
\end{equation}$$
where $Var(\varepsilon_i)= P_i(1-P_i)$ is related to $X$.