$$
\newcommand{\t}{\text}
\newcommand{\f}{\frac}
\newcommand{\Ra}{\Rightarrow}
\newcommand{\La}{\Lightarrow}
\newcommand{\ra}{\rightarrow}
\newcommand{\la}{\leftarrow}
\newcommand{\ua}{\uparrow}
\newcommand{\da}{\downarrow}
\newcommand{\mbb}{\mathbb}
\newcommand{\mbf}{\mathbf}
\newcommand{\mc}{\mathcal}
\newcommand{\pt}{\partial}
\newcommand{\S}[2]{\sum_{#1}^{#2}}
\newcommand{\P}[2]{\product_{#1}^{#2}}
\newcommand{\bmat}[4]{\begin{bmatrix} #1 & #2 \\ #3 & #4\end{bmatrix}}
$$
## Basics of machine learning
### Simple linear supervised learning model
+ Model
+ $\hat{y}=\theta^T\hat{\mathbf{x}}$
+ Cost function
+ $\mathcal{L}(\theta)=\frac{1}{2}\sum_{i=1}^N(y^{(i)}-\hat{y}^{(i)})^2$
+ best $\theta$
+ solve for $\theta$ s.t. $\frac{\partial{\mathcal{L}}}{\partial\theta}=0$
+ $\theta^*=(X^TX)^{-1}X^TY$
### Overfitting and underfitting
+ Datasets
+ training data: used to learn the parameters of your model.
+ testing data: used to score your model.
+ validation data: used to optimize the hyperparameters of your model.
+ avoid potential of overfitting.
+ Overfitting
+ training error $\downarrow$ but testing errror $\uparrow$.
+ solutions
+ more data
+ regularization
+ Underfitting
+ model overly simple. (more data won't help)
+ training error $\uparrow$ and testing error $\uparrow$.
+ K-fold cross validation
+ procedures
1. let the training dataset contain $N$ examples.
2. split the data into $k$ equal sets, each of $N/k$ examples. Each of these sets is called "fold."
3. $k-1$ of the folds are datasets that are used to train the model parameters.
4. the remain fole is a validation dataset used to evaluate the model.
5. may repeatly train the model by choosing which folds comprise the training folds and the testing fold.
### Maximum-likelihood estimation
+ Maximize the probability of having observed the data.
+ Data have some distribution with parameters.
## Supervised classification
### Image classification
+ Challenges
+ viewpoint variation
+ illumination (光照)
+ deformation
+ occlusion (遮擋)
+ background clutter (背景雜亂)
+ intraclass variation (類內變異)
+ Classification appoarch
+ machine learning algorithm see a lot of data and learn a model.
+ train: data $\rightarrow$ model parameters
+ test: model $\rightarrow$ class (new image)
### K-nearest neighbors
+ Algorithms
1. choose an appropriate distance metric for $d(\mathbb{x}^{(i)}, \mathbb{x}^{(j)})$.
2. choose $k$.
3. take desired test data point, $\mathbb{x^{new}}$, and calculate $d(\mathbb{x^{new}}, \mathbb{x}^{(i)})$ for $i=1, ..., m$.
4. with $\{c_1, ..., c_k\}$ denoting the $k$ indices corresponding to the $k$ smallest $d(\mathbb{x^{new}}, \mathbb{x}^{(i)})$, classify $\mathbb{x^{new}}$ as the class that occurs most frequently amongst $\{y^{c_1}, ..., y^{c_k}\}$.
5. if there is a tie, any of the tying classes may be selected.
+ Pros
+ simple
+ fast (no training process)
+ Cons
+ storage (store all training data)
+ computational expensive
+ not good for image classification
+ distance is less intuitive in higher dimensions
+ distances in some dimensions matter more than others.
+ volumn $\uparrow$ exponentially in higher-dim space.
+ leave lots of empty space.
+ Choose optimal k
+ Plot error v.s. k (elbow point)
+ Cross validation
### Linear classification
+ Model
+ $\mathbb{y}=W\mathbb{x}+b; W\in\mathbb{R}^{c\times{n}}, \mathbb{y}\in\mathbb{R}^c$
+ the chosen class corresponds to the index of the highest score in $\mathbb{y}$.
### Softmax function
+ Turn the scores into a probability.
+ Formula
+ $\text{softmax}_i(\mathbb{x})=\frac{e^{a_i(\mathbb{x})}}{\sum_{j=1}^ce^{a_j(\mathbb{x})}}; a_i(\mathbb{x})=\mathbb{w}_i^T\mathbb{x}+w_i, c$ is the number of classes.
+ $P(y^{(j)}=i|\mathbb{x}^{(j)},\theta)=\text{softmax}_i(\mathbb{x}^{(j)})$
+ Softmax loss function
+ maximize the likelihood of having seen the data.
+ $\begin{aligned}
\arg\max_\theta\prod_{i=1}^mp(y^{(i)}|\mathbb{x^{(i)}}, \theta)&=\arg\max_\theta\sum_{i=1}^m\log\text{softmax}_{y^{(i)}}(\mathbb{x}^{(i)}) \\
&=\arg\min_\theta\sum_{i=1}^m(\log\sum_{j=1}^ce^{a_j(\mathbb{x})}-a_{y^{(i)}}(\mathbb{x}^{(i)}))
\end{aligned}$
+ $\begin{aligned}
\log{P}(y=y^{(i)}|\mathbb{x})&=\log\text{softmax}_{y^{(i)}}(\mathbb{x}) \\
&=a_{y^{(i)}}-\log\sum_{j=1}^c\exp(a_j(\mathbb{x}))
\end{aligned}$
+ $a_{y^{(i)}}\uparrow$ and $\log\sum_{j=1}^c\exp(a_j(\mathbb{x}))\downarrow$ when maximizing.
+ $\log\sum_{j=1}^c\exp(a_j(\mathbb{x}))$ can be approximated by $\max_ja_j(\mathbb{x})$.
+ Overflow of softmax
+ $e^{a_i(\mathbb{x})}\uparrow$ if $a_i(\mathbb{x})\gg0$, and numerically overflow.
+ solution
+ $\text{softmax}_i(\mathbb{x})=\frac{e^{a_i(\mathbb{x})}+\log{k}}{\sum_{j=1}^ce^{a_j(\mathbb{x})}+\log{k}}=\frac{e^{a_i(\mathbb{x})-\max{a(\mathbb{x})}}}{\sum_{j=1}^ce^{a_j(\mathbb{x})-\max{a(\mathbb{x})}}}$
+ choose $k$ s.t. $\log{k}=-\max_ia_i(\mathbb{x})$
+ preserves the relative differences
+ softmax output remains the same
### Optimization and gradient descent
+ Goal: optimize an objective function $f(x)$.
+ Terminology
+ global minimum: $\mathbb{x_g}$ which $f(\mathbb{x})\geq{f}(\mathbb{x}_g); \forall{\mathbb{x}}$.
+ local minimum: $\mathbb{x}_l$ which is a critical point of $f(\mathbb{x})$ and is lower than its neighboring points. But, $f(\mathbb{x}_l)>f(\mathbb{x}_g)$.
+ saddle point is critical point of $f(\mathbb{x})$ that is not local maxima or minima.
+ Gradient
+ how a small change in $\Delta\mathbb{x}$ affects $f(\mathbb{x})$ through $f(\mathbb{x}+\Delta\mathbb{x})\approx{f}(\mathbb{x})+\Delta\mathbb{x}^T\nabla_{\mathbb{x}}f(\mathbb{x})$.
+ to minimize $f(\mathbb{x})$, find the direction in which $f(\mathbb{x})$ decreases the fastest.
+ update $\mathbb{x}$ by $\mathbb{x}-\epsilon\nabla_xf(\mathbb{x})$.
+ $\epsilon$: learning rate.
+ batch v.s. minibatch
+ expensive to calculate the gradient by using every example in the training set.
+ batch algorithm: uses all $m$ examples in the training set to calculate the gradient.
+ minibatch algorithm (mostly used): approximates the gradient by calculating it using $k$ training examples, where $m>k>1$.
+ stochastic algorithm: approximates the gradient by calculating it over one example.
## Neural networks
### Neural network architecture
+ 3-layer network example
+ first layer: $\mathbb{h}_1=f(\mathbf{W}_1\mathbb{x}+\mathbb{b}_1)$
+ second layer: $\mathbb{h}_2=f(\mathbf{W}_2\mathbb{h}_2+\mathbb{b}_2)$
+ output layer: $\mathbb{z}=\mathbf{W}_3\mathbb{h}_2+\mathbb{b}_3$
+ Linear activation function
+ any composition of linear functions can be reduced to a single linear function.
+ $\mathbb{z}=\mathbf{W}\mathbb{x}+\mathbb{b}; \mathbf{W}=\mathbf{W}_N...\mathbf{W}_2\mathbf{W}_1, \mathbb{b}=\mathbb{b}_N+\mathbf{W}_N\mathbb{b}_{N-1}+...+\mathbf{W}_N\cdot\cdot\cdot\mathbf{W}_2\mathbb{b}_1$.
+ useful in some contexts, e.g. autoencoder.
+ Introducing nonlinearity
+ increase the network capacity.
+ $\mathbb{h}_i=f(\mathbf{W}_i\mathbb{h}_{i-1}+\mathbb{b}_i)$
+ $f(\cdot)$: activation function
+ does not typically act on the output layer.
+ Hidden layers
+ features that the later layers then use for decoding.
+ don't have to be handcrafted.
+ Activation functions
+ sigmoid

+ $\sigma(x)=\frac{1}{1+\exp(-x)}$
+ $\frac{d\sigma(x)}{dx}=\sigma(x)(1-\sigma(x))$
+ pros
+ behaves linearly around $x=0$.
+ differentiable everywhere.
+ cons
+ saturates and has zero gradient at extremes. (no learning)
+ not zero-centered. (centered at 0.5)
+ always non-negative
+ zig-zagging (all the terms in $\partial{\mathcal{L}}/\partial{w}\leq0$ or $\geq0$)
+ hyperbolic tangent

+ $\tanh(x)=2\sigma(x)-1$
+ $\frac{d\tanh(x)}{dx}=1-\tanh^2(x)$
+ pros
+ hehaves linearly around $x=0$.
+ differentiable everywhere.
+ zero-centered. (faster convergence)
+ cons
+ saturates at extremes, no additional learning occurs.
+ ReLU

+ $\text{ReLU}(x)=\max(0, x)$
+ $\frac{d\text{ReLU}(x)}{dx}=\mathbf{I}(x>0)$
+ pros
+ converges faster than sigmoid and tanh practically.
+ behaves linearly when a unit is active.
+ differentiable everywhere, except $x=0$.
+ gradients are large and not scaled by second order effects when $x>0$.
+ no saturation if $x>0$.
+ cons
+ non-negative, zig-zagging.
+ not differentiable at $x=0$, but not a large issue practically.
+ learning does not happen for examples that have zero activation. (leaky ReLU)
+ Softplus

+ $\zeta(x)=\log(1+\exp(x))$
+ $\frac{d\zeta(x)}{dx}=\sigma(x)$
+ in place of ReLU, but performs worse than ReLU empirically.
+ differentiable everywhere.
+ saturates less completely. (dying ReLU)
+ computational expensive.
+ leaky ReLU / PReLU

+ $f(x)=max(\alpha{x}, x)$
+ pros
+ avoids the stopping of learning when $x<0$.
+ $\alpha$ can be treated as a hyperparameter. (default: 0.01)
+ PReLU: $\alpha$ is a parameter to be optimized in learning.
+ ELU

+ $f(x)=max(\alpha(\exp(x)-1), x)$
+ pros
+ exponential linear unit avoids the stopping of learning when $x<0$.
+ cons
+ requires computation of the exponential. (more expensive)
+ activation functions in practice
+ ReLU is very popular.
+ sigmoid is almost never used, tanh is preferred.
+ may be worth trying out leaky ReLU / PReLU / ELU / maxout.
+ Cost functions
+ the choice of output units interacts with what cost function to use.
+ $\text{MSE}=\frac{1}{2}\sum_{i=1}^n(y^{(i)}-\sigma(z^{(i)}))^2; \sigma(z^{(i)})=\hat{y}^{(i)}$.

+ when $y^{(i)}=1, \sigma(z^{(i)})\rightarrow-\infty\Rightarrow\text{MSE}\uparrow$, gradient saturates to zero and no learning occurs.
+ $\text{CE}=-\sum_{i=1}^n[y^{(i)}\log\sigma(z^{(i)})+(1-y^{(i)})\log(1-\sigma(z^{(i)}))]; \sigma(z^{(i)})=\hat{y}^{(i)}$.

+ when $y^{(i)}=1, \sigma(z^{(i)})\rightarrow-\infty$, learning will occur.
+ only stall when $z$ gets close to the right answer.
### Backpropagation
+ The application of the chain rule for derivatives.
+ Calculate the gradient of the objective with respect to the parameter, $\theta$.
+ Forward / backward propagation
+ forward propagation
+ taking input $\mathbb{x}$, propagating it to through each hidden unit sequentially, until arrive at the output $\mathbb{y}$.
+ e.g. $f(x, y, z)=(x+y)z$

+ backpropagation
+ method of computing gradients.
+ information is passed backward through the network.
+ from cost function and outputs to the inputs.
+ take the upstream derivative and apply a local gradient to do the calculation.

+ e.g. $f(x, y, z)=(x+y)z$

+ $\frac{\partial\mathcal{L}}{\partial{f}}=1$
+ $\frac{\partial\mathcal{L}}{\partial{z}}=\frac{\partial\mathcal{f}}{\partial{z}}\frac{\partial\mathcal{L}}{\partial{f}}$
+ $\frac{\partial\mathcal{L}}{\partial{w}}=\frac{\partial\mathcal{f}}{\partial{w}}\frac{\partial\mathcal{L}}{\partial{f}}$
+ $\frac{\partial\mathcal{L}}{\partial{x}}=\frac{\partial\mathcal{w}}{\partial{x}}\frac{\partial\mathcal{f}}{\partial{w}}\frac{\partial\mathcal{L}}{\partial{f}}$
+ e.g. two gradient paths converge
1. $x\ra{q_1}\ra\mc{L}$
2. $x\ra{q_2}\ra\mc{L}$
+ $\f{\pt\mc{L}}{\pt{x}}=\f{\pt\mc{L}}{\pt{q_1}}+\f{\pt\mc{L}}{\pt{q_2}}$
+ e.g. neural network layer

+ $\mbb{h}=\text{ReLU}(\mbf{W}\mbb{x}+\mbb{b}); \mbb{h}\in\mbb{R}^h, \mbb{x}\in\mbb{R}^m, \mbf{W}\in\mbb{R}^{h\times{m}}, \mbb{b}\in\mbb{R}^h$
+ $l=(\mbb{h}-\mbb{z})^2\Ra\nabla_{\mbb{h}}l=2(\mbb{h}-\mbb{z})$
+ $\f{\pt{l}}{\pt{\mbb{a}}}=\f{\pt{l}}{\pt{\mbb{c}}}=\mbf{I}(\mbb{a}>0)\odot\f{\pt{l}}{\pt{\mbb{h}}}$
+ $\f{\pt{l}}{\pt{\mbb{x}}}=\mbf{W}^T\f{\pt{l}}{\pt{\mbb{c}}}$
+ Why backprop
+ computationally efficient.
+ breaks up the calculation of the gradient into small and simple steps.
+ Derivative properties
+ derivative of a scalar w.r.t. a matrix
+ $y$ is scalar, $\mbf{A}\in\mbb{R}^{m\times{n}}\Ra\nabla_{\mbf{A}}y\in\mbb{R}^{m\times{n}}$.
+ derivative of a vector w.r.t. a vector
+ $\mbb{y}\in\mbb{R}^n, \mbb{x}\in\mbb{R}^m, \mbb{y}=\mbf{W}\mbb{x}\Ra\nabla_{\mbb{x}}\mbb{y}\in\mbb{R}^{m\times{n}}$.
+ $\mbb{x}\in\mbb{R}^m, \mbb{y}\in\mbb{R}^n, \mbb{z}\in\mbb{R}^p\Ra\nabla_{\mbb{x}}\mbb{z}=\nabla_{\mbb{x}}\mbb{y}\nabla_{\mbb{y}}\mbb{z}\in\mbb{R}^{m\times{p}}$.
+ hessian
+ the generalization of the second derivative.
+ $\nabla_{\mbb{x}}(\nabla_{\mbb{x}}f(\mbb{x}))$
+ tensor derivatives
+ $\mbb{y}\in\mbb{R}^m, \mbb{x}\in\mbb{R}^n, \mbf{W}\in\mbb{R}^{m\times{n}}, \mbb{y}=\mbf{W}\mbb{x}\Ra\nabla_{\mbf{W}}\mbb{y}\in\mbb{R}^{m\times{n}\times{m}}$.
### Initializations
+ Random weight initialization
+ small weight: cause all activations to decay to zero. (bad for learning)
+ small weights lead to small activations -> gradients to shrink during backprop
+ may be appropriate for smaller neural networks in practice.

+ distribution of each layer's activations

+ gradients for the last layer w.r.t. $\mbf{W}$
+ large weight: cause the units to explode.
+ large activations -> large gradient (gradients are proportional to the activations)

+ Xavier initialization
+ the variance of the units across all layers ought be the same, and that the same holds true for the backpropagated gradients.
+ initialize each weight in layer $i$ from $N(0, \f{2}{n_{\text{in}}+n_{\text{out}}})$.
+ typically leads to dying ReLU, but it is fine with tanh units.
+ $U(-\f{\sqrt{6}}{n_{\text{in}}+n_{\text{out}}}, \f{\sqrt{6}}{n_{\text{in}}+n_{\text{out}}})$ for ReLU.
+ Summary
+ Prevent vanishing/exploding gradients by keeping variance stable.
+ Scale weights based on the number of input/output neurons.
### Batch normalization
+ Motivation
+ the distribution of the inputs to each layer changes as learning occurs in the previous layers, causes the unit activations to be very variable.
+ assumes the other layers don't change when doing gradient descent, but these layers may change drastically.
+ makes the training of NN faster and more stable.
+ helps prevent vanishing/exploding gradients
+ acts like regularizer, often reducing the need of dropout
+ Idea
+ make the output of each layer have unit statistics.
+ affine $\ra$ BN $\ra$ ReLU $\ra$ affine $\ra$ BN $\ra$ ReLU.
+ Formula
+ $\hat{x_i}=\f{x_i-\mu_i}{\sqrt{\sigma_i^2+\epsilon}}; \mu_i=\f{1}{m}\S{j=1}{m}x_i^{(j)}, \sigma_i^2=\f{1}{m}\S{j=1}{m}(x_i^{(j)}-\mu_i)^2, \epsilon$ is small.
+ $i$: ith artificial units.
+ $m$: training batch example.
+ $\epsilon$: make sure to not divide by zero.
+ $y_i=\gamma_i\hat{x}_i+\beta_i$
+ $\gamma, \beta$: learnable parameters.
+ Notes
+ the reason the scaling is on a per unit basis is computational efficiency.
+ the scale and shift layer is inserted in case it is better that the activations not be zero mean and unit varriance.
+ typically placed right before the nonlinear activation.
+ $\mbb{h}_i=f(BN(\mbf{W}_i\mbb{h}_{i-1}+\mbb{b}_i))$
### Regularization
+ Motivation
+ improve model generalization.
+ tend to increase the estimator bias while reducing the estimator variance.
+ prevent overfitting.
+ model size and complexity.
+ Parameter norm penalties
+ modify the cost function with a parameter norm penalty
+ $J(\theta; \mbf{X}, \mbb{y})+\alpha\Omega(\theta); \alpha\geq0$.
+ $\alpha$: hyperparameter that weights the contribution of the norm penalty.
+ $\alpha=0$: no regularization.
+ $\alpha\ra\infty$: minize $\Omega(\theta)$.
+ L2 regularization (ridge regression / Tikhonov regularization)
+ cause the weights $\mbb{w}$ to have small norm.
+ $\Omega(\theta)=\f{1}{2}\mbb{w}^T\mbb{w}$
+ $\tilde{J}(\mbb{w};\mbf{X},\mbb{y})=J(\mbb{w};\mbf{X},\mbb{y})+\f{\alpha}{2}\mbb{w}^T\mbb{w}$
+ $\nabla_{\mbb{w}}\tilde{J}(\mbb{w}; \mbf{X},\mbb{y})=\alpha\mbb{w}+\nabla_{\mbb{w}}J(\mbb{w}; \mbf{X}, \mbb{y})$
+ $\mbb{w}\la\mbb{w}-\epsilon\nabla_{\mbb{w}}\tilde{J}(\mbb{w}; \mbf{X},\mbb{y})=(1-\epsilon\alpha)\mbb{w}-\epsilon\nabla_{\mbb{w}}J(\mbb{w}; \mbf{X}, \mbb{y})$
+ shrink the weights before performing the usual gradient update.
+ L1 regularization
+ $\Omega(\theta)=\Vert\mbb{w}\Vert_1=\S{i}{}|w_i|$
+ gradient is the same regardless of the size of $\mbb{w}$. (subgradient of $\Vert\mbb{w}\Vert_1$ is $\text{sign}(\mbb{w})$)
+ sparse solutions where $w_i=0$ for several $i$.
+ may be used for feature selection.
+ Dataset augmentation
+ flipped image
+ cropped image
+ adjust brightness
+ lens correction
+ rotate
+ inject noise into the network
+ label smoothing (make model less confident)
+ regularization
+ Multitask learning

+ have the model be trained to perform multiple tasks.
+ share common factors.
+ Transfer learning
+ trained in one context and use them in another with little additional training.
+ if the tasks are similar enough, then the features at later layers of the network ought to be good features for the new task.
+ if little training data is available, but the tasks are similar, just need to train a new linear layer at the output of the pre-trained network.
+ Feature extraction & fine-tuning
+ Ensemble learning
+ train multiple different models, and average their results together at test time.
+ almost always increases performance.
+ bagging
+ procedures
1. construct $k$ datasets using the bootstrap.
2. train $k$ different models using these $k$ datasets.
+ notes
+ in practice, neural networks reach a wide variety of solution given different initializations, hyperparameters, etc.
+ model averaging is computationally expensive.
+ Dropout
+ computationally inexpensive and effective method for generalization.
+ approximating the bagging procedure for exponentially many models, while only optimizing a single set of parameters.
+ each mask is like a different model. ($N$ hidden units $\ra$ $2^N$ different model configurations)
+ cause units to encode redundant features.
+ procedures

1. on a given training iteration, sample a binary mask (0 or 1) for all input and hidden units.
+ the bernoulli parameter $p$ is a hyperparameter.
+ typical values are that $p=0.8$ for input units and $p=0.5$ for hidden units.
2. apply the mask to all units.
3. perform the forward pass and backward pass, and parameter update.
4. in prediction, multiply each hidden unit by the parameter of its bernoulli mask $p$.
+ weight scaling inference rule
+ scale the activations: $p\cdot\text{ReLU}(\cdot)$
+ no additional complexity
+ inverted dropout
+ PyTorch & TensorFlow
+ the scaling by $1/p$ is done in training.
+ output to have the same expected value as if dropout was never been performed.
+ e.g.
activation vector: [a, b, c, d]
mask = [0, 1, 0, 1]
scaled output = [0, bx2, 0, dx2]
+ Mathematically

### Optimization
+ Gradient descent
+ $\theta\la\theta-\epsilon\nabla_{\theta}J(\theta)$
+ Stochastic gradient descent
+ small batch sizes can be seen to have regularization effect.
+ procedures
+ set a learning rate $\epsilon$ and an initial parameter setting $\theta$.
+ set a minibatch size of $m$ examples.
+ until the stopping criterion is met.
1. sample $m$ examples from the training set, $\{\mbb{x^{(1)}}, \mbb{x^{(2)}}, ..., \mbb{x^{(m)}}\}$ and their corresponding outputs $\{\mbb{y^{(1)}}, \mbb{y^{(2)}}, ..., \mbb{y^{(m)}}\}$.
2. compute the gradient estimate $\mbb{g}=\f{1}{m}\S{i=1}{m}\nabla_{\theta}J(\theta)$.
3. update parameters $\theta\la\theta-\epsilon\mbb{g}$.
+ Momentum
+ maintain the running mean of the gradients, which then updates the parameters.
+ average away the zigzagging components.
+ pushs the descent to find a local, but not global minimum.
+ tends to find shallow flat local minimum.
+ benefits
+ faster convergence
+ smoother path (reduce zig-zag)
+ escape shallow minima (despite the current gradient is near 0)
+ procedures

+ initialize $\mbb{v}=0$.
+ set $\alpha\in[0, 1]$, typical values are $\alpha=0.9$ or $0.99$.
+ until stopping criterion is met.
1. compute gradient $\mbb{g}$.
2. update $\mbb{v}\la\alpha\mbb{v}-\epsilon\mbb{g}$.
3. gradient step $\theta\la\theta+\mbb{v}$.
+ Nesterov momentum
+ the gradient is calculated at the parameter setting after taking a step along the direction of momentum.
+ improve convergence
+ gradient is calculated at lookahead position
+ procedures

+ initialize $\mbb{v}=0$.
+ until stopping criterion is met.
1. update $\mbb{v}\la\alpha\mbb{v}-\epsilon\nabla_{\theta}J(\theta+\alpha\mbb{v})$.
2. gradient step $\theta\la\theta+\mbb{v}$.
+ representation doesn't require evaluting the gradient $\theta+\alpha\mbb{v}$.
+ $\tilde{\theta}_{\text{old}}=\theta_{\text{old}}+\alpha\mbb{v}_{\text{old}}$
1. update $\mbb{v}_{\text{new}}=\alpha\mbb{v}_{\text{old}}-\epsilon\nabla_{\tilde{\theta}_{\text{old}}}J(\tilde{\theta}_{\text{old}})$.
2. gradient step $\tilde{\theta}_{\text{new}}=\tilde{\theta}_{\text{old}}+\mbb{v}_{\text{new}}+\alpha(\mbb{v}_{\text{new}}-\mbb{v}_{\text{old}})$.
3. set $\mbb{v}_{\text{new}}=\mbb{v}_{\text{old}}, \tilde{\theta}_{\text{new}}=\tilde{\theta}_{\text{old}}$.
+ Learning rate
+ in the beginning, a larger learning rate is typically better.
+ $\epsilon$ may need to be small to be able to make appropriate updates to the parameters.
+ approaches
+ annealing: decay rule to the learning rate.
+ update the learning rate based off of the history of gradient.
+ Adaptive gradient (Adagrad)
+ a form of stochastic gradient descent where the learning rate is decreased through division by the historical gradient norms.
+ procedures
+ initialize $\mbb{a}=0$.
+ set $v$ at a small value to avoid division by zero.
+ until stopping criterion is met.
1. compute the gradient $\mbb{g}$.
2. update $\mbb{a}\la\mbb{a}+\mbb{g}\odot\mbb{g}$.
3. gradient step: $\theta\la\theta-\f{\epsilon}{\sqrt{\mbb{a}}+v}\odot\mbb{g}$.
+ cons
+ learning rate keeps shrinking due to cumulative sum (evetually too small)
+ RMSProp
+ arguments Adagrad by making the gradient accumulator an exponentially weighted moving average.
+ procedures
+ initial $\mbb{a}=0$.
+ set $v$ to be sufficiently small.
+ set $\beta$ to be between 0 and 1. (typically a value like 0.99)
+ until stopping criterion is met.
1. compute the gradient $\mbb{g}$.
2. update $\mbb{a}\la\beta\mbb{a}+(1-\beta)\mbb{g}\odot\mbb{g}$.
3. gradient step $\theta\la\theta-\f{\epsilon}{\sqrt{\mbb{a}}+v}\odot\mbb{g}$.
+ pros
+ keep learning rates stable over time
+ cons
+ requires tuning $\beta$ and learning rate
+ RMSProp with momentum
+ procedures
+ initialize $\mbb{a}=0$.
+ set $\alpha, \beta$ to be between 0 and 1.
+ set $v=1e-7$.
+ until stopping criterion is met.
+ compute gradient $\mbb{g}$.
+ accumulate gradient $\mbb{a}\la\beta\mbb{a}+(1-\beta)\mbb{g}\odot\mbb{g}$.
+ momentum $\mbb{v}\la\alpha\mbb{v}-\f{\epsilon}{\sqrt{\mbb{a}}+v}\odot\mbb{g}$.
+ gradient step $\theta\la\theta+\mbb{v}$.
+ in practice, gets to the optimum more quickly than Adagrad.
+ Adam
+ adaptive moments without bias correction.
+ composed of a momentum-like step, followed by an Adagrad/RMSProp-like step.
+ Adam with no bias correction
+ procedures
+ initializee $\mbb{v}=0$ as the first moment, and $\mbb{a}=0$ as the second moment.
+ set $\beta_1$ and $\beta_2$ to be between 0 and 1. (suggested defaults are $\beta_1=0.9$ and $\beta_2=0.999$)
+ initialize $v$ to be sufficiently small.
+ until stopping criterion is met.
1. compute gradient $\mbb{g}$.
2. first moment update $\mbb{v}\la\beta_1\mbb{v}+(1-\beta_1)\mbb{g}$.
3. second moment update $\mbb{a}\la\beta_2\mbb{a}+(1-\beta_2)\mbb{g}\odot\mbb{g}$.
4. gradient step $\theta\la\theta-\f{\epsilon}{\sqrt{\mbb{a}}+v}\odot\mbb{v}$.
+ Adam with bias correction
+ bias corrections are to account for initialization.
+ without correction, the learning rate is effectively too small in the early iterations (slower & less stable)
+ procedures
+ initialize $\mbb{v}=0$ as the first moment, and $\mbb{a}=0$ as the second moment.
+ set $\beta_1$ and $\beta_2$ to be between 0 and 1. (suggested defaults are $\beta_1=0.9$ and $\beta_2=0.999$)
+ initialize $v$ to be sufficiently small.
+ initialize $t=0$.
+ until stopping criterion is met.
+ compute gradient $\mbb{g}$.
+ time update $t\la{t+1}$.
+ first moment update $\mbb{v}\la\beta_1\mbb{v}+(1-\beta_1)\mbb{g}$.
+ second moment update $\mbb{a}\la\beta_2\mbb{a}+(1-\beta_2)\mbb{g}\odot\mbb{g}$.
+ bias correction in moments $\tilde{\mbb{v}}=\f{1}{1-\beta_1^t}\mbb{v}, \tilde{a}=\f{1}{1-\beta_2^t}\mbb{a}$.
+ gradient step $\theta\la\theta-\f{\epsilon}{\sqrt{\tilde{\mbb{a}}}+v}\odot\tilde{\mbb{v}}$.
+ First order methods
+ methods
+ SGD
+ optimizers used to augment the learning rate: Adagrad, RMSProp, Adam.
+ only use the first derivative, take linear steps along the gradient.
+ Second order methods
+ use the curvature of the cost function to know how to take steps.
+ strategy
+ steep: small step
+ flat: large step
+ Challenges in gradient descent
+ exploding gradients
+ small changes in the parameters can drastically change the cost function.
+ usually happens if parameters are repeatedly multiplied together.
+ solutions
+ gradient clipping (upper bounds the maximum gradient norm.)
+ $\mbb{g}\la\f{\mbb{g}}{\Vert\mbb{g}\Vert}\cdot\text{clip}$ when $\Vert\mbb{g}\Vert>\text{clip}$.
+ vanishing gradients
+ repeated multiplication of a matrix $\mbf{W}$ can cause vanishing gradients.
+ solutions
+ architectural decisions
+ appropriate regularization
+ ReLU/Leaky ReLu: avoid saturation
+ Xavier initialization: balanced activations
+ Batch normalization
## CNN

### Motivation
+ Fully connected networks require many parameters.
+ take long to train.
+ may be prone to overfitting.
### Convolutional layer
+ Defines a collection of filters, each with the same dimension as the input.
+ input: $(w, h, d)$.
+ filter: $(w_f, h_f, d)$, and typically $w_f<w$.
+ output: $(w-w_f+1, h-h_f+1)$.
+ Multiple filters
+ call each output a slice.
+ the output slices of the convolution operations with each filter are composed together to form a $(w-w_f+1, h-h_f+1, n_f)$ tensor, where $n_f$ is the number of filters.
+ the output will be passed through an activation function, be the input to the next layer.
+ Convolutional layers have sparse interactions.

+ each output is connected to a small number of inputs.
+ pros
+ reduce computational memory.
+ each neuron has $w\cdot{h}\cdot{d}$ weights in fully connected layer, while $w_f\cdot{h_f}\cdot{d}$ in convolutional layer.
+ reduce computational time.
+ for $m$ input and $n$ output in a hidden layer, fully connected layer require $O(mn)$ operations, while convolutional layer require only $O(kn)$ if each output is connected to only $k$ input.
+ share parameters.
+ in every output neuron in a given slice uses the same set of parameters in the filter.
+ cons
+ information from different parts of the input may not interact.
+ compose of more layers.
+ Padding
+ output: $(w-w_f+1+2\cdot\text{pad}, h-h_f+1+2\cdot\text{pad})$.
+ optimal amount of zero padding is between $\text{pad}=0$ and the $\text{pad}$ that causes the output to have the same width and height.
+ Stride

+ defines how much the filter moves in the convolution.
+ output: $(\f{w-w_f+2\text{pad}}{\text{stride}}+1, \f{h-h_f+2\text{pad}}{\text{stride}}+1)$.
### Pooling layer
+ Downsampling
+ Output: $(\f{w-w_p}{\text{stride}}+1, \f{h-h_p}{\text{stride}}+1)$ when pooling filter is $(w_p, h_p)$.
+ Notes
+ no additiona parameters.
+ keep important features.
+ intend to introduce small spatial invariances.

### CNN architecture

+ Several paired convolutional-relu layer.
+ Potentially one max pooling layer.
+ May end with a fully connected-relu layer, followed by a softmax classifier.
### Case studies
+ LeNet-5

+ [CONV-POOL]x2 - CONV - FC - OUT
+ AlexNet

+ SqueezeNet
+ lightweight CNN designed to achieve AlexNet-level accurary with 50x fewer parameters
+ smaller model size / competitive accuray / faster inference


+ ZFNet
+ smaller filters applied at smaller strides.
+ having more filters later on in deeper layers.
+ VGGNet
+ [CONVx2-POOL]x3 - [CONVx3-POOL]x2 - FCx3 - Softmax
+ all CONV filters are uniform: 3x3 with stride 1, pad 1.
+ all POOL filters are uniform: 2x2 max pool with stride 2.
+ GoogLeNet

+ inception module
+ extract multi-scale features efficiently and in parallel
+ motivation
+ huge variation in the location of information.
+ large kernel for information distributed globally.
+ small kernel for imformation spreaded locally.
+ very deep networks are prone to overfitting.
+ have filters with multiple sizes operate on the same level.

+ ResNet
## RNN
### Motivation
+ CNN lacks recurrent connectivity, no dynamics.
+ CNN will always produce the same output given the same input, irrespective of what happend in the past.
### RNN architecture
+ Vanilla RNN

+ $\mbb{h}_t=f(\mbf{W}_{\text{rec}}\mbb{h}_{t-1}+\mbb{b}_{\text{rec}}+\mbf{W}_{\text{in}}\mbb{x}_t)$
+ $\mbb{z}_t=\mbf{W}_{\text{out}}\mbb{h}_t+\mbb{b}_{\text{out}}$
+ $\hat{y}_{t, i}=\text{softmax}_i(\mbb{z}_t)$
+ RNN cost function
+ regression: $\S{t}{}\Vert\hat{\mbb{y}_t}-\mbb{y}_t\Vert^2$.
+ classification: $\S{t}{}\mc{L}_t; \mc{L}_t$ is the softmax loss at time $t$.
+ only care about the output after some time $\mc{T}$: $\S{t\geq\mc{T}}{}\mc{L}_t$.
+ RNN training

+ there are several gradient paths to the parameters when doing backpropagation.
+ Vanishing and exploding gradients
+ backpropagation through time.
+ solutions
+ truncated backpropagation.
+ e.g. TBPTT length = 20
feed time steps 0-19, backprop throuhg 0-19
feed 20-39, backprop through 20-39 (hidden state from step 19 -> input for step 20)
+ cons
+ cannot fully capture long-term dependencies
+ constrain norm to be a certain value to address exploding gradients.
+ $\mbb{g}\la\f{c}{\Vert\mbb{g}\Vert}\mbb{g}$
+ weight initialization.
## LSTM
### Motivation
+ Address the problem of vanishing and exploding gradients.
+ store short-term memory for a long period of time.
### Architecture

+ Formula
+ $\mbb{f}_t=\sigma(\mbf{W}_f[\mbb{h_{t-1}}\;\mbb{x}_t]^T+\mbb{b}_f)$
+ $\mbb{i}_t=\sigma(\mbf{W}_i[\mbb{h_{t-1}}\;\mbb{x}_t]^T+\mbb{b}_i)$
+ $\mbb{v}_t=\tanh(\mbf{W}_v[\mbb{h_{t-1}}\;\mbb{x}_t]^T+\mbb{b}_v)$
+ $\mbb{o}_t=\sigma(\mbf{W}_o[\mbb{h_{t-1}}\;\mbb{x}_t]^T+\mbb{b}_o)$
+ $\mbb{c}_t=\mbb{f}_t\odot\mbb{c}_{t-1}+\mbb{i}_t\odot\mbb{v}_t$
+ $\mbb{h}_t=\mbb{o}_t\odot\tanh(\mbb{c}_t)$
+ Cell state

+ think of the cell state as some memory or tape.
+ hold some value and remembers it.
+ three things can do to the cell state at each point of time:
+ forget information
+ write-in information
+ read-out information
+ forgetting information
+ forget gate $\mbb{f}_t$, this comprises the update, $\mbb{c}_t=\mbb{c}_{t-1}\odot\mbb{f}_t$.
+ $\mbb{f}_t\approx1\ra\mbb{c}_t\approx\mbb{c}_{t-1}$: most information is remembered.
+ $\mbb{f}_t\approx0\ra\mbb{c}_t\approx0$: most information is forgotten.
+ writing in information
+ value gate, $\mbb{v}_t\in(-1, 1)$ tells us the value we want to add to the cell states.
+ input gate, $\mbb{i}_t\in(0, 1)$ tells us how much of the value gate, $\mbb{v}_t$ to write to the cell state.
+ $\mbb{c}_t=\mbb{c}_{t-1}+\mbb{i}_t\odot\mbb{v}_t$.
+ reading out information
+ update the hidden state after updating the cell state by forgetting information and writing in information.
+ output gate, $\mbb{o}_t$, tells us how much of the cell state is going to be read out to the hidden state.
+ if $\mbb{o}_t$ is small, very little will be read out.
+ if $\mbb{o}_t$ is large, a lot of the cell state will be read out.
+ $\mbb{h}_t=\mbb{o}_t\odot\tanh(\mbb{c}_t)$.