## DeepNN lecture 3
### Ferenc Huszár
[https://www.reddit.com/r/CST_DeepNN/](https://www.reddit.com/r/CST_DeepNN/)
---
## Approximation
---
### Basic Multilayer Perceptron
\begin{align}
f_l(x) &= \phi(W_l f_{l-1}(x) + b_l)\\
f_0(x) &= x
\end{align}
---
### Basic Multilayer Perceptron
$$
\small
f_L(x) = \phi\left(b_L + W_L \phi\left(b_{L-1} + W_{L-1} \phi\left( \cdots \phi\left(b_1 + W_1 x\right) \cdots \right)\right)\right)
$$
---
### Rectified Linear Unit
$$
\phi(x) = \left\{\matrix{0&\text{when }x\leq 0\\x&\text{when }x>0}\right.
$$
![](https://i.imgur.com/SxKdrzb.png)
---
### What can these networks represent?
$$
\operatorname{ReLU}(\mathbf{w}_1x - \mathbf{b}_1)
$$
![](https://i.imgur.com/rN5wRVJ.png)
---
### What can these networks represent?
$$
f(x) = \mathbf{w}^T_2 \operatorname{ReLU}(\mathbf{w}_1x - \mathbf{b}_1)
$$
![](https://i.imgur.com/kX3nuYg.png)
---
### Single hidden layer
number of kinks $\approx O($ width of network $)$
---
### Example: "sawtooth" network
\begin{align}
f_l(x) &= 2\vert f_{l-1}(x)\vert - 2 \\
f_0(x) &= x
\end{align}
---
### Sawtooth network
\begin{align}
f_l(x) &= 2 \operatorname{ReLU}(f_{l-1}(x)) + 2 \operatorname{ReLU}(-f_{l-1}(x)) - 2\\
f_0(x) &= x
\end{align}
---
### $0$-layer network
![](https://i.imgur.com/pucqIVN.png)
---
### $1$-layer network
![](https://i.imgur.com/YOTtTY7.png)
---
### $2$-layer network
![](https://i.imgur.com/reii7O5.png)
---
### $3$-layer network
![](https://i.imgur.com/J6KiUHI.png)
---
### $4$-layer network
![](https://i.imgur.com/fHTZhU0.png)
---
### $5$-layer network
![](https://i.imgur.com/ni4QV2b.png)
---
### Deep ReLU networks
number of kinks $\approx O(2^\text{depth of network})$
---
### In higher dimensions
![](https://i.imgur.com/0NVHFEN.png)
---
### In higher dimensions
![](https://i.imgur.com/DJtv5Yj.jpg)
---
### Approximation: summary
* depth increases model complexity more than width
* model clas defined by deep networks is VERY LARGE
* both an advantage, but and cause for concern
* "complex models don't generalize"
---
## Generalization
---
## Generalization
![](https://i.imgur.com/Tu5SHpr.png)
---
## Generalization
![](https://i.imgur.com/8bkhxAv.png)
---
## Generalization
![](https://i.imgur.com/YHedAr6.png)
---
## Generalization: deep nets
![](https://i.imgur.com/bfyRBsx.png)
---
## Generalization: deep nets
![](https://i.imgur.com/fzLYvHe.png)
---
## Generalization: summary
* **classical view:** generalization is property of model class and loss function
* **new view:** it is also a property of the optimization algorithm
---
## Generalization
* optimization is core to deep learning
* new tools and insights:
* infinite width neural networks
* neural tangent kernel [(Jacot et al, 2018)](https://arxiv.org/abs/1806.07572)
* deep linear models ([e.g. Arora et al, 2019](https://arxiv.org/abs/1905.13655))
* importance of initialization
* effect of gradient noise
---
## Gradient-based optimization
$$
\mathcal{L}(\theta) = \sum_{n=1}^N \ell(y_n, f(x_n, \theta))
$$
$$
\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta)
$$
---
### Basic Multilayer Perceptron
$$
\small
f_L(x) = \phi\left(b_L + W_L \phi\left(b_{L-1} + W_{L-1} \phi\left( \cdots \phi\left(b_1 + W_1 x\right) \cdots \right)\right)\right)
$$
---
## General deep function composition
$$
f_L(f_{L-1}(\cdots f_1(\mathbb{w})))
$$
How do I calculate the derivative of $f_L(\mathbb{w})$ with respect to $\mathbb{w}$?
---
## Chain rule
$$
\frac{\partial \mathbf{f}_L}{\partial \mathbb{w}} = \frac{\partial \mathbf{f}_L}{\partial \mathbf{f}_{L-1}} \frac{\partial \mathbf{f}_{L-1}}{\partial \mathbf{f}_{L-2}} \frac{\partial \mathbf{f}_{L-2}}{\partial \mathbf{f}_{L-3}} \cdots \frac{\partial \mathbf{f}_3}{\partial \mathbf{f}_{2}} \frac{\partial \mathbf{f}_2}{\partial \mathbf{f}_{1}} \frac{\partial \mathbf{f}_1}{\partial \mathbf{w}}
$$
---
## How to evaluate this?
$$
\frac{\partial \mathbf{f}_L}{\partial \mathbb{w}} = \frac{\partial \mathbf{f}_L}{\partial \mathbf{f}_{L-1}} \frac{\partial \mathbf{f}_{L-1}}{\partial \mathbf{f}_{L-2}} \frac{\partial \mathbf{f}_{L-2}}{\partial \mathbf{f}_{L-3}} \cdots \frac{\partial \mathbf{f}_3}{\partial \mathbf{f}_{2}} \frac{\partial \mathbf{f}_2}{\partial \mathbf{f}_{1}} \frac{\partial \mathbf{f}_1}{\partial \mathbf{w}}
$$
$$
\small
\frac{\partial \mathbf{f}_L}{\partial \mathbb{w}} = \frac{\partial \mathbf{f}_L}{\partial \mathbf{f}_{L-1}} \left( \frac{\partial \mathbf{f}_{L-1}}{\partial \mathbf{f}_{L-2}} \left( \frac{\partial \mathbf{f}_{L-2}}{\partial \mathbf{f}_{L-3}} \cdots \left( \frac{\partial \mathbf{f}_3}{\partial \mathbf{f}_{2}} \left( \frac{\partial \mathbf{f}_2}{\partial \mathbf{f}_{1}} \frac{\partial \mathbf{f}_1}{\partial \mathbf{w}} \right) \right) \cdots \right) \right)
$$
---
## Or like this?
$$
\frac{\partial \mathbf{f}_L}{\partial \mathbb{w}} = \frac{\partial \mathbf{f}_L}{\partial \mathbf{f}_{L-1}} \frac{\partial \mathbf{f}_{L-1}}{\partial \mathbf{f}_{L-2}} \frac{\partial \mathbf{f}_{L-2}}{\partial \mathbf{f}_{L-3}} \cdots \frac{\partial \mathbf{f}_3}{\partial \mathbf{f}_{2}} \frac{\partial \mathbf{f}_2}{\partial \mathbf{f}_{1}} \frac{\partial \mathbf{f}_1}{\partial \mathbf{w}}
$$
$$
\small
\frac{\partial \mathbf{f}_L}{\partial \mathbb{w}} = \left( \left( \cdots \left( \left( \frac{\partial \mathbf{f}_L}{\partial \mathbf{f}_{L-1}} \frac{\partial \mathbf{f}_{L-1}}{\partial \mathbf{f}_{L-2}} \right) \frac{\partial \mathbf{f}_{L-2}}{\partial \mathbf{f}_{L-3}} \right) \cdots \frac{\partial \mathbf{f}_3}{\partial \mathbf{f}_{2}} \right) \frac{\partial \mathbf{f}_2}{\partial \mathbf{f}_{1}} \right) \frac{\partial \mathbf{f}_1}{\partial \mathbf{w}}
$$
---
## Or in a funky way?
$$
\frac{\partial \mathbf{f}_L}{\partial \mathbb{w}} = \frac{\partial \mathbf{f}_L}{\partial \mathbf{f}_{L-1}} \frac{\partial \mathbf{f}_{L-1}}{\partial \mathbf{f}_{L-2}} \frac{\partial \mathbf{f}_{L-2}}{\partial \mathbf{f}_{L-3}} \cdots \frac{\partial \mathbf{f}_3}{\partial \mathbf{f}_{2}} \frac{\partial \mathbf{f}_2}{\partial \mathbf{f}_{1}} \frac{\partial \mathbf{f}_1}{\partial \mathbf{w}}
$$
$$
\small
\frac{\partial \mathbf{f}_L}{\partial \mathbb{w}} = \frac{\partial \mathbf{f}_L}{\partial \mathbf{f}_{L-1}} \left( \left( \left( \frac{\partial \mathbf{f}_{L-1}}{\partial \mathbf{f}_{L-2}} \right) \frac{\partial \mathbf{f}_{L-2}}{\partial \mathbf{f}_{L-3}} \right) \left( \left( \cdots \frac{\partial \mathbf{f}_3}{\partial \mathbf{f}_{2}} \right) \frac{\partial \mathbf{f}_2}{\partial \mathbf{f}_{1}} \right) \right)\frac{\partial \mathbf{f}_1}{\partial \mathbf{w}}
$$
---
## Automatic differentiation
### Forward-mode
$$
\small
\frac{\partial \mathbf{f}_L}{\partial \mathbb{w}} = \frac{\partial \mathbf{f}_L}{\partial \mathbf{f}_{L-1}} \left( \frac{\partial \mathbf{f}_{L-1}}{\partial \mathbf{f}_{L-2}} \left( \frac{\partial \mathbf{f}_{L-2}}{\partial \mathbf{f}_{L-3}} \cdots \left( \frac{\partial \mathbf{f}_3}{\partial \mathbf{f}_{2}} \left( \frac{\partial \mathbf{f}_2}{\partial \mathbf{f}_{1}} \frac{\partial \mathbf{f}_1}{\partial \mathbf{w}} \right) \right) \cdots \right) \right)
$$
Cost:
$$
\small
d_0d_1d_2 + d_0d_2d_3 + \ldots + d_0d_{L-1}d_L = d_0 \sum_{l=2}^{L}d_ld_{l-1}
$$
---
## Automatic differentiation
### Reverse-mode
$$
\small
\frac{\partial \mathbf{f}_L}{\partial \mathbb{w}} = \left( \left( \cdots \left( \left( \frac{\partial \mathbf{f}_L}{\partial \mathbf{f}_{L-1}} \frac{\partial \mathbf{f}_{L-1}}{\partial \mathbf{f}_{L-2}} \right) \frac{\partial \mathbf{f}_{L-2}}{\partial \mathbf{f}_{L-3}} \right) \cdots \frac{\partial \mathbf{f}_3}{\partial \mathbf{f}_{2}} \right) \frac{\partial \mathbf{f}_2}{\partial \mathbf{f}_{1}} \right) \frac{\partial \mathbf{f}_1}{\partial \mathbf{w}}
$$
Cost:
$$
\small
d_Ld_{L-1}d_{L-2} + d_{L}d_{L-2}d_{L-3} + \ldots + d_Ld_{1}d_0 = d_L \sum_{l=0}^{L-2}d_ld_{l+1}
$$
---
## Automatic differentiation
* in deep learning we're most interested in scalar objectives
* $d_L=1$, consequently, backward mode is always optimal
* in the context of neural networks: **backpropagation**
* backprop has higher memory cost than forwardprop
---
## Example: calculating a Hessian
$$
H(\mathbb{w}) = \frac{\partial^2}{\partial\mathbf{w}\partial\mathbf{w}^T} L(\mathbf{w})
= \frac{\partial}{\partial\mathbf{w}} \mathbf{g}(\mathbf{w})
$$
---
## Example: Hessian-vector product
$$
\mathbf{v}^TH(\mathbf{w}) = \frac{\partial}{\partial\mathbf{w}} \left( \mathbf{v}^T \mathbf{g}(\mathbf{w}) \right)
$$
{"metaMigratedAt":"2023-06-15T19:07:19.943Z","metaMigratedFrom":"YAML","title":"DeepNN Lecture 3 Slides","breaks":true,"description":"Lecture notes on how classical learning theory and generalisation theory was unable to explain the good performance of deep neural netwoks, and new insights that were needed","contributors":"[{\"id\":\"e558be3b-4a2d-4524-8a66-38ec9fea8715\",\"add\":10460,\"del\":1644}]"}