Neural network considerations; training

# Neural nets: more considerations, training  slide: https://hackmd.io/@ccornwell/neural-networks2 --- <h3>Neural network, binary classification</h3> ![](https://i.imgur.com/UTNkqhk.png =x350) - <font size=+2 style="color:#181818;">$\sigma(z) = 1/(1+e^{-z})$; and $0<\sigma(z) < 1$</font> - <font size=+2 style="color:#181818;">closer that $\sigma(z)$ is to $1$, more confident prediction of $+1$ class.</font> ---- <h3>Neural network, binary classification</h3> ![](https://i.imgur.com/UTNkqhk.png =x350) - <font size=+2>$\sigma(z) = 1/(1+e^{-z})$; and $0<\sigma(z) < 1$</font> - <font size=+2>closer that $\sigma(z)$ is to $1$, more confident prediction of $+1$ class.</font> --- <h3>Multi-classification / vector output</h3> - <font size=+2>If more than $2$ possible predicted classes...say $k$ classes.</font> ![](https://i.imgur.com/lYqPV1k.png =x200) - <font size=+2 style="color:#181818;">Have weights on edges going into final layer $\leadsto$ some vector ${\bf z}\in\mathbb R^k$. Then $\operatorname{softmax}({\bf z})$ defined to have $i^{th}$ component equal to $$\left(\operatorname{softmax}({\bf z})\right)_{i} = \dfrac{e^{z_i}}{\sum_{j=1}^ke^{z_j}}.$$</font> ---- <h3>Multi-classification / vector output</h3> - <font size=+2>If more than $2$ possible predicted classes...say $k$ classes.</font> ![](https://i.imgur.com/lYqPV1k.png =x200) - <font size=+2>Have weights on edges going into final layer $\leadsto$ some vector ${\bf z}\in\mathbb R^k$. Then $\operatorname{softmax}({\bf z})$ defined to have $i^{th}$ component equal to $$\left(\operatorname{softmax}({\bf z})\right)_{i} = \dfrac{e^{z_i}}{\sum_{j=1}^ke^{z_j}}.$$</font> ---- <h3>Multi-classification / vector output</h3> - <font size=+2>Makes $\operatorname{softmax}({\bf z})$ a probability vector.</font> - <font size=+2 style="color:#181818;">Predicted class (among $1,\ldots,k$) is the $i$ such that $\left(\operatorname{softmax}({\bf z})\right)_{i}$ is largest.</font> - <font size=+2 style="color:#181818;">There is a _multi-class_ cross-entropy that works much like log-loss, gets used as the loss function.</font> - <font size=+2 style="color:#181818;">When $i^{*}$ is correct class _and_ predicted class, closer $\left(\operatorname{softmax}({\bf z})\right)_{i^{*}}$ is to 1, the smaller the contribution to loss.</font> ---- <h3>Multi-classification / vector output</h3> - <font size=+2>Makes $\operatorname{softmax}({\bf z})$ a probability vector.</font> - <font size=+2>Predicted class (among $1,\ldots,k$) is the $i$ such that $\left(\operatorname{softmax}({\bf z})\right)_{i}$ is largest.</font> - <font size=+2 style="color:#181818;">There is a _multi-class_ cross-entropy that works much like log-loss, gets used as the loss function.</font> - <font size=+2 style="color:#181818;">When $i^{*}$ is correct class _and_ predicted class, closer $\left(\operatorname{softmax}({\bf z})\right)_{i^{*}}$ is to 1, the smaller the contribution to loss.</font> ---- <h3>Multi-classification / vector output</h3> - <font size=+2>Makes $\operatorname{softmax}({\bf z})$ a probability vector.</font> - <font size=+2>Predicted class (among $1,\ldots,k$) is the $i$ such that $\left(\operatorname{softmax}({\bf z})\right)_{i}$ is largest.</font> - <font size=+2>There is a _multi-class_ cross-entropy that works much like log-loss, gets used as the loss function.</font> - <font size=+2>When $i^{*}$ is correct class _and_ predicted class, closer $\left(\operatorname{softmax}({\bf z})\right)_{i^{*}}$ is to 1, the smaller the contribution to loss.</font> --- <h3>Example: classifying handwritten digits</h3> - <font size=+2>Data: images of handwritten numerical digits (28x28, greyscale)</font> - <font size=+2 style="color:#181818;">Basic neural net: one middle layer, 128 nodes. Output w/ $\operatorname{softmax}$, 10 nodes.</font> - <font size=+2 style="color:#181818;">Data $\to$ input: "flatten" images to vectors in $\mathbb R^N$ for $N=28^2$.</font> ![](https://i.imgur.com/z73aZa0.png =x250) <font size=+2>*Train "epoch" to Accuracy*</font> ---- <h3>Example: classifying handwritten digits</h3> - <font size=+2>Data: images of handwritten numerical digits (28x28, greyscale)</font> - <font size=+2>Basic neural net: one middle layer, 128 nodes. Output w/ $\operatorname{softmax}$, 10 nodes.</font> - <font size=+2>Data $\to$ input: "flatten" images to vectors in $\mathbb R^N$ for $N=28^2$.</font> ![](https://i.imgur.com/z73aZa0.png =x250) <font size=+2>*Train "epoch" to Accuracy*</font> ---- <h3>Example: classifying handwritten digits</h3> - <font size=+2>Number of trainable weights in above setup: 101,770.</font> - <font size=+2>Modeling problem, **overfitting**: more parameters than needed (for data & task); model does well on training data, poorly on yet unseen data.</font> - <font size=+2 style="color:#181818;">Many ways to mitigate problem of overfitting. In this example, a _different network structure_ can help (e.g. one with < $10,000$ weights, and other measures to combat overfit).</font> ---- <h3>Example: classifying handwritten digits</h3> - <font size=+2>Number of trainable weights in above setup: 101,770.</font> - <font size=+2>Modeling problem, **overfitting**: more parameters than needed (for data & task); model does well on training data, poorly on yet unseen data.</font> - <font size=+2>Many ways to mitigate problem of overfitting. In this example, a _different network structure_ can help (e.g. one with < $10,000$ weights, and other measures to combat overfit).</font> --- <h3>How to train a neural network?</h3> - <font size=+2>**Math of it**. Use *stochastic* gradient descent on loss function (e.g. *cross-entropy*: log-loss, in binary case, or multi-class version if more than two classes).</font> - <font size=+2 style="color:#181818;">**Implementation** (*how to do it*).</font> - <font size=+2 style="color:#181818;">For each ${\bf x_i}$, determine node values in network (intermediate layers' "feed forward" values).</font> - <font size=+2 style="color:#181818;">First, think last layer is model and get $D$ of loss (w.r.t. weights into output layer), using "feed forward" data values.</font> - <font size=+2 style="color:#181818;">Go back one layer. Do same, but note $D(\ell_{last}\circ\ell_{last-1}) = D\ell_{last}D_{last-1}$. Just computed $D_{last}$, so get partials w.r.t this layer weights by using that computation.</font> ---- <h3>How to train a neural network?</h3> - <font size=+2>**Math of it**. Use *stochastic* gradient descent on loss function (e.g. *cross-entropy*: log-loss, in binary case, or multi-class version if more than two classes).</font> - <font size=+2>**Implementation** (*how to do it*).</font> - <font size=+2>For each ${\bf x_i}$, determine node values in network (intermediate layers' "feed forward" values).</font> - <font size=+2 style="color:#181818;">First, think last layer is model and get $D$ of loss (w.r.t. weights into output layer), using "feed forward" data values.</font> - <font size=+2 style="color:#181818;">Go back one layer. Do same, but note $D(\ell_{last}\circ\ell_{last-1}) = D\ell_{last}D_{last-1}$. Just computed $D_{last}$, so get partials w.r.t this layer weights by using that computation.</font> ---- <h3>How to train a neural network?</h3> - <font size=+2>**Math of it**. Use *stochastic* gradient descent on loss function (e.g. *cross-entropy*: log-loss, in binary case, or multi-class version if more than two classes).</font> - <font size=+2>**Implementation** (*how to do it*).</font> - <font size=+2>For each ${\bf x_i}$, determine node values in network (intermediate layers' "feed forward" values).</font> - <font size=+2>First, think last layer is model and get $D$ of loss (w.r.t. weights into output layer), using "feed forward" data values.</font> - <font size=+2 style="color:#181818;">Go back one layer. Do same, but note $D(\ell_{last}\circ\ell_{last-1}) = D\ell_{last}D_{last-1}$. Just computed $D_{last}$, so get partials w.r.t this layer weights by using that computation.</font> ---- <h3>How to train a neural network?</h3> - <font size=+2>**Math of it**. Use *stochastic* gradient descent on loss function (e.g. *cross-entropy*: log-loss, in binary case, or multi-class version if more than two classes).</font> - <font size=+2>**Implementation** (*how to do it*).</font> - <font size=+2>For each ${\bf x_i}$, determine node values in network (intermediate layers' "feed forward" values).</font> - <font size=+2>First, think last layer is model and get $D$ of loss (w.r.t. weights into output layer), using "feed forward" data values.</font> - <font size=+2>Go back one layer. Do same, but note $D(\ell_{last}\circ\ell_{last-1}) = D\ell_{last}D_{last-1}$. Just computed $D_{last}$, so get partials w.r.t this layer weights by using that computation.</font>