# CS231n L2 L3 Image Classification Pipeline
Image = big grid of numbers between [0,255]

Challenges: angles, lumination, deformation, occlusion(e.g. part of a cat), background clutter, intraclass variation
## Image Classifier
Data-Driven Approach
1. Collect dataset of images & labels
2. Use ML to train classifier
3. Evaluate on real data
### Nearest Neighbor
Train: memorize all data and labels
Predict: predict the label of the most similar training image
Return the label of the image with the closest distance metric (nearest neighbor).
#### Distance Metric
##### L1 distance (Manhattan)


The speed of training and prediction:
train -> O(1)
predict -> O(N)
=> Bad (fast at training, slow at prediction)
##### L2 distance (Euclidean)

:::success
If rotate coordinate frame -> L1 changes, but L2 don't
:::
### K-Nearest Neighbor
Finding 'K' nearest neighbors instead of a single nearest neighbor.
Instead of copying label from nearest neibor, take **majority vote** from K closest points.
**Never used** on classifying images in real world:
1. Slow at test time
2. Distance metrics on pixels are not informative
3. Curse of dimensionality (D = 1 needs 4, D = 2 needs 4^2, D = 3 needs 4^3...)
### Linear Classification
*Neural networks are like legos. Linear classifiers are like basic building blocks of the huge network.*

b = biased term
* Data independent preferences for some classes over another
* If unbalanced, e.g cats >> dogs, then b of cats > dogs

*Linear Classification Example*

*Trying to draw lines to categorize*
:::success
Only allow to learn 1 template per category
:::
#### Parametric Approach
Input: images (x) and weights (W)
Output: scores for each class
Output = 
:::success
Only need **weights** in test time => more efficient
:::
## Loss Function
*To determine the weights. Tells how good the current classifier is.*
### Multiclass SVM loss
Sum up all the losses between each incorrect class and the correct class.
Loss = max(0, score of incorrect class - score of correct class + 1).
Total loss = average of the losses for all the catogories.


Let loss(cat) = 2.9, loss(car) = 0, loss(frog) = 12.9, total loss (3 catogories only) = (2.9 + 0 + 12.9) / 3 = 5.27
:::success
Values of the losses do not have much meaning.
:::
### Softmax Classifier - multinomial logistic regression

*Score = -log(probability of the correct class)*
:::success
**Values matter.** Target: get the probability of the correct class to 1 (to lower the loss).
:::
### Recap
:::info

:::
## Optimization
*To find the best values of weights.*
### Random search
Randomly set the values of weights and put them in the loss function to rate the performance.
Perfom bad in practice. Don't use.
### Follow the slope
1D: Find the derivative of the function.
Multi-D
* The **gradient** is the vector of (partial derivatives) along each dimension.
* Slope in any direction is the **dot product** of the direction with the gradient.
* The direction of steepest descent is the **negative gradient**.
#### Finding the gradients
* Numerical: approximate, slow, easy to write
* Analytic: exact, fast, error-rone
In practice: Always use analytic. But check with numerical (gradient check).
##### Gradient Descent
*Determine where and how far to step next.*
Learning step: A kind of hyperameter. Determine how big a step is to move forward to the lowest-loss weight set.
###### Stochastic Gradient Descent (SGD)
Calculating the full sum is too expensive when the data set is huge.
Calculate approximate sum using a **minibatch** of examples instead.
##### Backpropagation
*A tool for computing gradients in complex functions.*
Split the computation graph into computation nodes. Calculate the gradients of each node from the back (output) by derivatives and chain rule.

*Red values: gradients*
Grouping up some nodes:
 -> 

Functions of different gates:
* Add gate: gradient distributor

* Max gate: gradient router

* Mul gate: gradient switcher

Jacobian Matrix: 
###### Vectorized Example
The gradient with respect to a variable should have the **same** shape as the variable.

Code Implemtation:

*Forward: computing result of an operation and save any intermediates
Backward: computing the gradients*
## Hyperameters
Q: How to choose the best hyperparameters (e.g. value of K, distance to use)?
A: Try them all and choose the best.
Split the data into training, validation and testing sets.
Only touch the testing set (the unseen data) once and on the very end.
### Cross-Validation
Split the data into folds. Choose different folds to be the validation data for different times.

*5-fold cross validation*
## Over-Fitting
Occam's Razor: Pick the simplest one (To prevent over fitting).


Blue line: Over fitted
Green line: Simpler, regularization added
:::success
Do not over fit the training data, as the goal of ML is to predict the unseen data (test data).
:::
### Regularizations
*Measuring model complexity*
#### L1 regularization: 
#### L2 regularization: 
#### Elastic net (L1 + L2): 
#### Max norm regularization
#### Dropout
## Visualizing Features

*E.g.1 Color Histogram*
***

*E.g.2 Histogram of Oriented Gradients (HoG)*
***

*E.g.3 Bag of Words*