# Neural nets: more considerations, training
<!-- Put the link to this slide here so people can follow -->
slide: https://hackmd.io/@ccornwell/neural-networks2
---
<h3>Neural network, binary classification</h3>

- <font size=+2 style="color:#181818;">$\sigma(z) = 1/(1+e^{-z})$; and $0<\sigma(z) < 1$</font>
- <font size=+2 style="color:#181818;">closer that $\sigma(z)$ is to $1$, more confident prediction of $+1$ class.</font>
----
<h3>Neural network, binary classification</h3>

- <font size=+2>$\sigma(z) = 1/(1+e^{-z})$; and $0<\sigma(z) < 1$</font>
- <font size=+2>closer that $\sigma(z)$ is to $1$, more confident prediction of $+1$ class.</font>
---
<h3>Multi-classification / vector output</h3>
- <font size=+2>If more than $2$ possible predicted classes...say $k$ classes.</font>

- <font size=+2 style="color:#181818;">Have weights on edges going into final layer $\leadsto$ some vector ${\bf z}\in\mathbb R^k$. Then $\operatorname{softmax}({\bf z})$ defined to have $i^{th}$ component equal to $$\left(\operatorname{softmax}({\bf z})\right)_{i} = \dfrac{e^{z_i}}{\sum_{j=1}^ke^{z_j}}.$$</font>
----
<h3>Multi-classification / vector output</h3>
- <font size=+2>If more than $2$ possible predicted classes...say $k$ classes.</font>

- <font size=+2>Have weights on edges going into final layer $\leadsto$ some vector ${\bf z}\in\mathbb R^k$. Then $\operatorname{softmax}({\bf z})$ defined to have $i^{th}$ component equal to $$\left(\operatorname{softmax}({\bf z})\right)_{i} = \dfrac{e^{z_i}}{\sum_{j=1}^ke^{z_j}}.$$</font>
----
<h3>Multi-classification / vector output</h3>
- <font size=+2>Makes $\operatorname{softmax}({\bf z})$ a probability vector.</font>
- <font size=+2 style="color:#181818;">Predicted class (among $1,\ldots,k$) is the $i$ such that $\left(\operatorname{softmax}({\bf z})\right)_{i}$ is largest.</font>
- <font size=+2 style="color:#181818;">There is a _multi-class_ cross-entropy that works much like log-loss, gets used as the loss function.</font>
- <font size=+2 style="color:#181818;">When $i^{*}$ is correct class _and_ predicted class, closer $\left(\operatorname{softmax}({\bf z})\right)_{i^{*}}$ is to 1, the smaller the contribution to loss.</font>
----
<h3>Multi-classification / vector output</h3>
- <font size=+2>Makes $\operatorname{softmax}({\bf z})$ a probability vector.</font>
- <font size=+2>Predicted class (among $1,\ldots,k$) is the $i$ such that $\left(\operatorname{softmax}({\bf z})\right)_{i}$ is largest.</font>
- <font size=+2 style="color:#181818;">There is a _multi-class_ cross-entropy that works much like log-loss, gets used as the loss function.</font>
- <font size=+2 style="color:#181818;">When $i^{*}$ is correct class _and_ predicted class, closer $\left(\operatorname{softmax}({\bf z})\right)_{i^{*}}$ is to 1, the smaller the contribution to loss.</font>
----
<h3>Multi-classification / vector output</h3>
- <font size=+2>Makes $\operatorname{softmax}({\bf z})$ a probability vector.</font>
- <font size=+2>Predicted class (among $1,\ldots,k$) is the $i$ such that $\left(\operatorname{softmax}({\bf z})\right)_{i}$ is largest.</font>
- <font size=+2>There is a _multi-class_ cross-entropy that works much like log-loss, gets used as the loss function.</font>
- <font size=+2>When $i^{*}$ is correct class _and_ predicted class, closer $\left(\operatorname{softmax}({\bf z})\right)_{i^{*}}$ is to 1, the smaller the contribution to loss.</font>
---
<h3>Example: classifying handwritten digits</h3>
- <font size=+2>Data: images of handwritten numerical digits (28x28, greyscale)</font>
- <font size=+2 style="color:#181818;">Basic neural net: one middle layer, 128 nodes. Output w/ $\operatorname{softmax}$, 10 nodes.</font>
- <font size=+2 style="color:#181818;">Data $\to$ input: "flatten" images to vectors in $\mathbb R^N$ for $N=28^2$.</font>

<font size=+2>*Train "epoch" to Accuracy*</font>
----
<h3>Example: classifying handwritten digits</h3>
- <font size=+2>Data: images of handwritten numerical digits (28x28, greyscale)</font>
- <font size=+2>Basic neural net: one middle layer, 128 nodes. Output w/ $\operatorname{softmax}$, 10 nodes.</font>
- <font size=+2>Data $\to$ input: "flatten" images to vectors in $\mathbb R^N$ for $N=28^2$.</font>

<font size=+2>*Train "epoch" to Accuracy*</font>
----
<h3>Example: classifying handwritten digits</h3>
- <font size=+2>Number of trainable weights in above setup: 101,770.</font>
- <font size=+2>Modeling problem, **overfitting**: more parameters than needed (for data & task); model does well on training data, poorly on yet unseen data.</font>
- <font size=+2 style="color:#181818;">Many ways to mitigate problem of overfitting. In this example, a _different network structure_ can help (e.g. one with < $10,000$ weights, and other measures to combat overfit).</font>
----
<h3>Example: classifying handwritten digits</h3>
- <font size=+2>Number of trainable weights in above setup: 101,770.</font>
- <font size=+2>Modeling problem, **overfitting**: more parameters than needed (for data & task); model does well on training data, poorly on yet unseen data.</font>
- <font size=+2>Many ways to mitigate problem of overfitting. In this example, a _different network structure_ can help (e.g. one with < $10,000$ weights, and other measures to combat overfit).</font>
---
<h3>How to train a neural network?</h3>
- <font size=+2>**Math of it**. Use *stochastic* gradient descent on loss function (e.g. *cross-entropy*: log-loss, in binary case, or multi-class version if more than two classes).</font>
- <font size=+2 style="color:#181818;">**Implementation** (*how to do it*).</font>
- <font size=+2 style="color:#181818;">For each ${\bf x_i}$, determine node values in network (intermediate layers' "feed forward" values).</font>
- <font size=+2 style="color:#181818;">First, think last layer is model and get $D$ of loss (w.r.t. weights into output layer), using "feed forward" data values.</font>
- <font size=+2 style="color:#181818;">Go back one layer. Do same, but note $D(\ell_{last}\circ\ell_{last-1}) = D\ell_{last}D_{last-1}$. Just computed $D_{last}$, so get partials w.r.t this layer weights by using that computation.</font>
----
<h3>How to train a neural network?</h3>
- <font size=+2>**Math of it**. Use *stochastic* gradient descent on loss function (e.g. *cross-entropy*: log-loss, in binary case, or multi-class version if more than two classes).</font>
- <font size=+2>**Implementation** (*how to do it*).</font>
- <font size=+2>For each ${\bf x_i}$, determine node values in network (intermediate layers' "feed forward" values).</font>
- <font size=+2 style="color:#181818;">First, think last layer is model and get $D$ of loss (w.r.t. weights into output layer), using "feed forward" data values.</font>
- <font size=+2 style="color:#181818;">Go back one layer. Do same, but note $D(\ell_{last}\circ\ell_{last-1}) = D\ell_{last}D_{last-1}$. Just computed $D_{last}$, so get partials w.r.t this layer weights by using that computation.</font>
----
<h3>How to train a neural network?</h3>
- <font size=+2>**Math of it**. Use *stochastic* gradient descent on loss function (e.g. *cross-entropy*: log-loss, in binary case, or multi-class version if more than two classes).</font>
- <font size=+2>**Implementation** (*how to do it*).</font>
- <font size=+2>For each ${\bf x_i}$, determine node values in network (intermediate layers' "feed forward" values).</font>
- <font size=+2>First, think last layer is model and get $D$ of loss (w.r.t. weights into output layer), using "feed forward" data values.</font>
- <font size=+2 style="color:#181818;">Go back one layer. Do same, but note $D(\ell_{last}\circ\ell_{last-1}) = D\ell_{last}D_{last-1}$. Just computed $D_{last}$, so get partials w.r.t this layer weights by using that computation.</font>
----
<h3>How to train a neural network?</h3>
- <font size=+2>**Math of it**. Use *stochastic* gradient descent on loss function (e.g. *cross-entropy*: log-loss, in binary case, or multi-class version if more than two classes).</font>
- <font size=+2>**Implementation** (*how to do it*).</font>
- <font size=+2>For each ${\bf x_i}$, determine node values in network (intermediate layers' "feed forward" values).</font>
- <font size=+2>First, think last layer is model and get $D$ of loss (w.r.t. weights into output layer), using "feed forward" data values.</font>
- <font size=+2>Go back one layer. Do same, but note $D(\ell_{last}\circ\ell_{last-1}) = D\ell_{last}D_{last-1}$. Just computed $D_{last}$, so get partials w.r.t this layer weights by using that computation.</font>
{"metaMigratedAt":"2023-06-15T22:17:57.896Z","metaMigratedFrom":"YAML","title":"Neural network considerations; training","breaks":true,"description":"View the slide with \"Slide Mode\".","contributors":"[{\"id\":\"da8891d8-b47c-4b6d-adeb-858379287e60\",\"add\":9277,\"del\":213}]"}