Gradient descent & Logistic regression 2

# Gradient descent for Logistic regression (cont'd)  slide: https://hackmd.io/@ccornwell/grad-desc-logistic-regression-2 --- <h3>Gradient of log-loss function</h3> - <font size=+2>Partial of loss function w.r.t. $w_j$ is: $\frac{1}{m}\sum_ix_{i,j}(h({\bf x_i})-\overset{\circ}{y_i})$</font> - <font size=+2>Parameters you evaluate gradient at, buried in $h({\bf x_i}) = \sigma({\bf w}\cdot{\bf x_i}+b)$.</font> <font size=+2 style="color:#181818;">Observations:</font> - <font size=+2 style="color:#181818;">$(h({\bf x_i})-\overset{\circ}{y_i})$ is positive when $y_i=-1$ and negative when $y_i=1$;</font> - <font size=+2 style="color:#181818;">$-1 < (h({\bf x_i})-\overset{\circ}{y_i}) < 1$</font> - <font size=+2 style="color:#181818;">Size of $j^{th}$ coord's of data can have effect on size of the gradient step.</font> ---- <h3>Gradient of log-loss function</h3> - <font size=+2>Partial of loss function w.r.t. $w_j$ is: $\frac{1}{m}\sum_ix_{i,j}(h({\bf x_i})-\overset{\circ}{y_i})$</font> - <font size=+2>Parameters you evaluate gradient at, buried in $h({\bf x_i}) = \sigma({\bf w}\cdot{\bf x_i}+b)$.</font> <font size=+2>Observations:</font> - <font size=+2>$(h({\bf x_i})-\overset{\circ}{y_i})$ is positive when $y_i=-1$ and negative when $y_i=1$;</font> - <font size=+2>$-1 < (h({\bf x_i})-\overset{\circ}{y_i}) < 1$</font> - <font size=+2>Size of $j^{th}$ coord's of data can have effect on size of the gradient step.</font> --- <h3> Viewing the loss function </h3> - <font size=+2>The log-loss averages $\log(1+e^{- y_iz_i})$, where $z_i = {\bf w}\cdot{\bf x_i}+b$.</font> - <font size=+2>In other words, each ${\bf x_i}$ contributes, to the average, a value either on left or on right of the graph $y=\log(1+e^t)$:</font> ![](https://i.imgur.com/O0yw4y4.png =500x) --- <h3> Alternative loss function: ReLU </h3> - <font size=+2>Define a function, the **Rectified Linear Unit**:</font> > $\scriptsize\operatorname{ReLU}(z) = \begin{cases}z\ \text{if }z\ge0;\\ 0\ \text{otherwise.}\end{cases}$ - <font size=+2>For $|z_i|>>0$, $\quad\log(1+e^{-y_iz_i}) \approx \operatorname{ReLU}(-y_iz_i)$.</font> - <font size=+2 style="color:#181818;">Derivative of $\operatorname{ReLU}$ defined, *except at* $z=0$. </font> ---- <h3> Alternative loss function: ReLU </h3> - <font size=+2>Define a function, the **Rectified Linear Unit**:</font> > $\scriptsize\operatorname{ReLU}(z) = \begin{cases}z\ \text{if }z\ge0;\\ 0\ \text{otherwise.}\end{cases}$ - <font size=+2>For $|z_i|>>0$, $\quad\log(1+e^{-y_iz_i}) \approx \operatorname{ReLU}(-y_iz_i)$.</font> - <font size=+2>Derivative of $\operatorname{ReLU}$ defined, *except at* $z=0$. </font> --- <h3>Alternative loss function: ReLU</h3> - <font size=+2>When $y_i=1$, insert $\operatorname{ReLU}(-z)$ in place of $\log(1+e^{-z})$ to the partial derivative computation: get </font> > <font style="color:#242424;">$\scriptsize\begin{cases}-x_{i,j}\ \text{if }{\bf w}\cdot{\bf x_i}+b<0\\ 0\qquad \text{otherwise}.\end{cases}$</font> - <font size=+2 style="color:#181818;">When $y_i=-1$, then the derivative is $x_{i,j}$ when ${\bf w}\cdot{\bf x_i}+b > 0$, and $0$ otherwise. So, overall, we get </font> > <font style="color:#242424;">$\scriptsize\frac{\partial}{\partial w_j}\left(\frac{1}{m}\sum_i\operatorname{ReLU}(-y_iz_i)\right) = -\frac{1}{m}{\displaystyle\sum_{i:\ y_iz_i<0}(y_ix_{i,j})}$</font> ---- <h3>Alternative loss function: ReLU</h3> - <font size=+2>When $y_i=1$, insert $\operatorname{ReLU}(-z)$ in place of $\log(1+e^{-z})$ to the partial derivative computation: get </font> > $\scriptsize\begin{cases}-x_{i,j}\ \text{if }{\bf w}\cdot{\bf x_i}+b<0\\ 0\qquad \text{otherwise}.\end{cases}$ - <font size=+2>When $y_i=-1$, then the derivative is $x_{i,j}$ when ${\bf w}\cdot{\bf x_i}+b > 0$, and $0$ otherwise. So, overall, we get </font> > <font style="color:#242424;">$\scriptsize\frac{\partial}{\partial w_j}\left(\frac{1}{m}\sum_i\operatorname{ReLU}(-y_iz_i)\right) = -\frac{1}{m}{\displaystyle\sum_{i:\ y_iz_i<0}(y_ix_{i,j})}$</font> ---- <h3>Alternative loss function: ReLU</h3> - <font size=+2>When $y_i=1$, insert $\operatorname{ReLU}(-z)$ in place of $\log(1+e^{-z})$ to the partial derivative computation: get </font> > $\scriptsize\begin{cases}-x_{i,j}\ \text{if }{\bf w}\cdot{\bf x_i}+b<0\\ 0\qquad \text{otherwise}.\end{cases}$ - <font size=+2>When $y_i=-1$, then the derivative is $x_{i,j}$ when ${\bf w}\cdot{\bf x_i}+b > 0$, and $0$ otherwise. So, overall, we get </font> > $\scriptsize\frac{\partial}{\partial w_j}\left(\frac{1}{m}\sum_i\operatorname{ReLU}(-y_iz_i)\right) = -\frac{1}{m}{\displaystyle\sum_{i:\ y_iz_i<0}(y_ix_{i,j})}$