CS231n L4 Introduction to Neural Network

# CS231n L4 Introduction to Neural Network ## Introduction Simpler functions stack on top of each other in a hierarchical way in order to make a complex non-linear function. ![](https://i.imgur.com/wVMCdzS.png) :::success Allow to weight together multiple templates to get the overall final score for the class ::: ![](https://i.imgur.com/zTi2xGe.png) ### Architecture ![](https://i.imgur.com/wwLkrO4.png) *Hidden layer: vector* ## Convolutional Neural Networks ### History Cat Experiment Hierarchical organization: different types of cells responding to different visual stimulus Neurocognitron sandwich architecture (SCSCSC...) simple cells(S): modifiable parameters complex cells(C): perfom pooling Gradient-based learning Applied to document recognition using backpropagation Deep Convolutional Neural Networks ImageNet Classification ### Introduction ![](https://i.imgur.com/hzfzuJ1.jpg =80%x) ![](https://i.imgur.com/BWHTssd.png =80%x) Calculate the dot product of the image and the filter (plot the filter on different spots on the image). ![](https://i.imgur.com/D9dvlx0.png =80%x) ConvNet: Sequence of Conv layers, interspersed with activation functions. Output of the previous conv layer = Input of the next conv layer Filters on earlier layer: low-level features (like results in cat experiment) Increasing depths: increasing the number of filters (works well irl) :::success Preserve spatial structure ::: ### Conv Layer (Activation Map) ![](https://i.imgur.com/LM9YjLT.png =40%x) E.g. Stride of 1 (3x3 filter) (stride of 3 won't work) N = width(length), F = filter size ![](https://i.imgur.com/3Lsk6LI.png =30%x) ![](https://i.imgur.com/BzbNYok.png =40%x) *In practice: Padding zeros around the border (maintain full size output -> won't shrink too fast)* :::success An activation map is a big sheet of neuron outputs. Each is connected to a small region in the input. ::: #### Brain View of Conv Layer How much this neuron has been triggered at every spatial location in the image. Preserve spatial structure, reasoning on top of activation maps. ### Pooling Layer (Downsampling) Makes the representations smaller and more manageable. Operates over each activation map independently. Doesn't change the input depth. Won't use zero-padding during pooling. ![](https://i.imgur.com/TPnErSh.png =60%x) #### Max Pooling ![](https://i.imgur.com/tDIotWw.png =70%x) Taking the maximum value in each filter. Why max pooling? Meaning of value: how much this filter fired the region. ### Activation Functions #### Sigmoid Ranging from 0 to 1, interpreted as **firing rate** of a neuron. ![](https://i.imgur.com/51JwS22.png =30%x) ![](https://i.imgur.com/6CDABi9.png =40%x) :::danger Problems: * Saturated neurons **kill** the gradients (e.g. x = -10, x = 10...). Can't pass down gradient flow (gradient = 0). * Sigmoid outputs are not zero-centered. Inefficient gradient updates. (only allow updates in same direction) ![](https://i.imgur.com/v8TVlo9.png =60%x) * exp() is compute expensive ::: #### tanh(x) Ranging from -1 to 1. **Zero centered.** But still kills gradients when saturated. ![](https://i.imgur.com/jHo1Kbq.png =40%x) #### ReLU ![](https://i.imgur.com/7nxH8Bl.png =25%x) * Doesn't saturated (in pos region). * Converges faster than sigmod/tanh. * More biologically plausible. (Similar inputs/outputs of real neurons). * **Not zero-centered** ![](https://i.imgur.com/48ZRGEr.png =40%x) :::danger Problem: ![](https://i.imgur.com/IzV3mJi.png =60%x) Learning rate too big -> Function well in the beginning but won't work later on. ::: #### Leaky ReLU * Doesn't saturated * Fast and efficient * Doesn't die ![](https://i.imgur.com/8GfZlAJ.png =40%x) #### PReLU More flexibility, using alpha instead of hard-coded number (0.01). ![](https://i.imgur.com/5CZ80hB.png =40%x) #### ELU Exponential Linear Units * All benefits of ReLU * Closer to zero mean outputs * **More robustness** to noise compared to Leaky ReLU * **Computation requires exp()** ![](https://i.imgur.com/lfFQnFU.png =50%x) #### Maxout Neuron Generalizes ReLU and Leaky ReLU, taking the maximum of two functions. This doubles the number of parameters per neuron. ![](https://i.imgur.com/wZ7hyEp.png =40%x) ### Data Preprocessing ![](https://i.imgur.com/b266nTR.png =90%x) ![](https://i.imgur.com/8410sWz.png =90%x) *Not commonly used in processing images* :::success For images, center only. (Don't want to downgrade the dimension). ::: ### Weight Initialization #### Random Assignment For small networks, randomly generating numbers for weights is okay. Random assignment won't work for deep networks. If weights too small, all activations = 0. If weights too big, gradients = 0 (saturated). ![](https://i.imgur.com/F7b4usB.png =70%x) *Activations = 0* ![](https://i.imgur.com/Ky3MrIQ.png =70%x) *Gradients = 0* #### Xavier Initialization Many inputs -> smaller weights. ![](https://i.imgur.com/ogQas88.png =70%x) ![](https://i.imgur.com/3MJHxTG.png =70%x) #### For ReLU Because it's killing half of the units (setting them to 0), has to add additional **/2** at the end. ![](https://i.imgur.com/p5RZ4XJ.png =70%x) ### Moniting Training and Adjusting Parameters 1. Preprocess the data 2. Choose the architecture (e.g. 1 hidden layer of 50 neurons) ---Start of sanity check--- 3. Check if the loss is reasonable 4. After adding regularization, loss should go up 5. Try to get small loss on small dataset ---End of sanity check--- 6. Find the learning rate that makes loss go down ### Hyperparameter Optimization Grid Search v.s. Random Search ![](https://i.imgur.com/yMDSxoc.png =60%x) *Grid: missing important values* ![](https://i.imgur.com/3WJwOeR.png =50%x) *Changing learning rate* ![](https://i.imgur.com/SpfiR8e.png =50%x) *Big gap between training / validation accuracy -> overfitting* ### Optimization Problems with SGD: * Too many directions in high dimensional problem -> zigzagging ![](https://i.imgur.com/Ff9PLjj.png =50%x) * Loss function has a local minima or saddle point (gradient = 0) ![](https://i.imgur.com/6rqupCQ.png =50%x) #### First-Order Optimization ![](https://i.imgur.com/8YNASt6.png =50%x) ##### SGD + Momentum *Build up **velocity** as a running mean of gradients.* ![](https://i.imgur.com/opQKEdz.png =60%x) ![](https://i.imgur.com/wl6mURk.png =50%x) *Velocity building up while moving downhill, passing through saddle points* ![](https://i.imgur.com/5Cc9I2u.png =80%x) ##### AdaGrad *Add **grad_squared** term (historical sum of squares in each dimension) while updating gradients. To decrease the phenomenon of zigzagging.* :::success Step size getting smaller over time. (Might get stuck at saddle points) ::: To solve this problem: ![](https://i.imgur.com/m13KXsN.png =80%x) ##### Adam *RMSProp with momentum.* ![](https://i.imgur.com/mx1Dw90.png =80%x) :::info Learning rate decay over time during training is commonly used. ![](https://i.imgur.com/DocieGa.png =60%x) ::: #### Second-Order Optimization ![](https://i.imgur.com/9NQBDXv.png =60%x) Stepping toward the minimum of the quadratic -> Doesn't require learning rate. E.g. L-BFGS ### Prevent Overfitting #### Model Ensembles Train multiple independent models or average their results at test time. Increase performance by several percentage. Polyak averaging: Instead of using actual parameter vector, keep a moving average of the parameter vector and use that at test time. #### Regularization *Improve single-model performance.* Training: Add some randomness Testing: Average out randomness ##### Dropout Randomly set some neurons to 0 in each forward pass. Probability of dropping (hyperparameter) is usually set to 0.5. ![](https://i.imgur.com/XEKxGGt.png =80%x) Left: original version ; Right: dropped out version ![](https://i.imgur.com/OU4mB6P.png =80%x) At test time, multiply by dropout probability => approximate the integral ![](https://i.imgur.com/b7x6JLc.png =80%x) Inverted dropout: do the division at training time to prevent multiplication at test time ##### Batch Normalization *To make gaussian activations. To increase efficiency (allow higher learning rate) and prevent gradients becoming 0.* Inserted after Fully Connected or Convolutional layers, and before nonlinearity(e.g. tanh). ![](https://i.imgur.com/CtLHhUZ.png =70%x) *Retrieved from https://medium.com/ching-i/batch-normalization-%E4%BB%8B%E7%B4%B9-135a24928f12* ![](https://i.imgur.com/OWa1ZBl.png =40%x) *Normalizing minibatch (inputs)* ![](https://i.imgur.com/9nes8uy.png =40%x) *Allow the network to squash the range* ![](https://i.imgur.com/qKHBJ06.png =40%x) *Scaling and shifting* ##### Data Augmentation Adding random transformations of the input data * Image crop / flip / rotation * Color jitter (randomize contrast and brightness) ##### Drop Connect Rather then zeroing out activations at every forward path, randomly zero out some of the values of the weight matrices. ##### Fractional Max Pooling Randomize the pooling region. ##### Stochastic Depth Randomly drop layers during training. But use the whole network during test time. #### Transfer Learning Doesn't require huge dataset to train CNN. ![](https://i.imgur.com/Vc38XSx.png =80%x) For bigger dataset, lower learning rate when finetuning (e.g. 1/10).

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.