{%hackmd theme-dark %}
# Machine Learning
###### tags: `ML` `大學`
## Types of functions
* Regression: Determine a scalar output (Eg predict pm2.5)
* Classification: Classify for results (Eg AlphaGo)
* Structured Learning: Generate structured data (Create docs, imgs)
## Steps of machine learning
* Guess **Model** function structure based on domain knowledge
* Ex $y=b+wx_1$, where $b$ (bias) and $w$ (weight) are unknown parameters to be derived from data
* Define **Loss** from training data, loss is how good a set of values is L(b,w)
* Label is the correct value
* Delta between label and predicted value based on a set of parameter is called $e$
* $e = |y-\hat y|$ => Mean Absolute Error (MAE)
* $e = (y-\hat y)^2$ => Mean Square Error (MSE)
* if $y$ and $\hat y$ are both probability distributions => Cross-entropy
* Drawing a graph of loss is based on different parameters is called **error surface**
* ![](https://i.imgur.com/4oO2vOg.png)
* **Optimization**: method we use is is called Gradient Descend.
* Example shows single parameter, real case is $w^*,b^*=arg \min\limits_{w,b}L$
* Pick a init value $w_0$
* Compute diffentiate on $w=w_0$ (Slope of $w_0$ on error surface 1D single param)
* If slope is negative, increase $w$ by $\eta \frac{\partial L}{\partial w}$
* If slope is positive, increase $w$ by $\eta \frac{\partial L}{\partial w}$
* $\eta$ is *learning rate*
* Anything we set ourselves is called **hyperparameters**
* Update $w$, $w_1 \leftarrow w_0 \pm \eta \frac{\partial L}{\partial w}$
* :::warning
Optimization can create a issue where a local minima is found but not a global minima.
However this is not really a problem
:::
## More flexible model
* Model bias is a bias introduced by the fact that model cannot represent the real data
* A non linear function can be approximated by a piecewise function given enough points are sampled
* To represent a function, we can use Sigmoid Function $y=c \frac{1}{1+e^{-(b+wx_1)}}$ => $c\,sigmoid(b+wx_1)$
* ![](https://i.imgur.com/o5KbzLM.png)
* Summary
* ![](https://i.imgur.com/4DVgTNp.png)
* ![](https://i.imgur.com/JTzgwVP.png)
* Batch, update and epoch
* For the big vector $\theta$, we may cut up the vector to smaller pieces called batch, each batch will update the variables once. An epoch is when all batch ran once.
* Eg: 10,000 examples (N=10,000) Batch size is 10 (B = 10)
* 1 epoch contains 1000 updates
* Eg: 1,000 examples (N=1000) Batch size is 100 (B = 100)
* 1 epoch contains 10 updates
* ReLU (Rectified Linear Unit)
* $c\,max(0,b+wx)$
* Sigmoid and ReLU here is called **Activation function**
## General Guide
![](https://i.imgur.com/0mNtJvE.png)
## Local minima and saddle point
* In order to differentiate local minima and saddle point when in a critical point (gradient=0). We need to find the function representing loss
* It is not feasible to find the real function of $L(\theta)$. However we can approximate $L(\theta)$ around $\theta=\theta'$: (Tayler Series Approximation)
$$
L(\theta)\approx L(\theta')+(\theta-\theta')^Tg+\frac{1}{2}(\theta-\theta')^TH(\theta-\theta')
$$
Gradient $g$ is a vector, $g=\nabla L(\theta')$, $g_i=\frac{\partial L(\theta')}{\partial \theta_i}$
Hessian $H$ is a matrix, $H_{ij}=\frac{\partial^2}{\partial\theta_i\partial\theta_j}L(\theta')$
![](https://i.imgur.com/sipxU3r.png)
* When on a critical point the term $(\theta-\theta')^Tg$ would be 0, and the remaining term $\frac{1}{2}(\theta-\theta')^TH(\theta-\theta')$ represents the properties of critical point
* Let $v=(\theta-\theta')$, $L(\theta)\approx L(\theta')+\frac{1}{2}v^THv$
* For all v $v^THv>0$: Around $\theta'$ $L(\theta)>L(\theta')$: Local minima
* $H$ is positive definite, all eigen values are positive
* For all v $v^THv>0$: Around $\theta'$ $L(\theta)>L(\theta')$: Local maxima
* $H$ is negative definite, all eigen values are negative
* For some v $v^THv>0$, for some v $v^THv<0$: Saddle point
* Some eigen values are positive, some are negative
* When on saddle point, we can use $H$ to find the directions we should update.
* Find the eigenvector of the negative eigenvalue of H
* Update the parameter along the direction of any of the eigenvector
* Rarely used as calculation cost is high
## Batch and Momentum
* Large batch size causes less updates than small batch size
* With parallel computation large batch has no difference in speed until the batch is really too large
* The time for one epoch is shorter with large batch
* Gradient on small batch size tends to jump around (noisy)
* Optimization is better with small batch size as perhaps small batch size is easier to avoid settling in a steep local minima compared to a flatter local minima
* Momentum adds additional "momentum" when updating parameters according to gradient
* Momentum is calculated according to the previous movement and amount
* $m^i$ is actually the weighted sum of all previous gradient $g^0,g^1...$
* ![](https://i.imgur.com/QK13j80.png)
## Adaptive learning rate
* Most of the time critical point is not the reason of stuck loss
* Gradient descend is really bad when the learning rate is not optimized to the current condition
* Oscillating between two peeks and never falling into the minima *(Need lower learning rate)*
![](https://i.imgur.com/7hMBbmp.png)
* Never reaching the low point *(Need higher learning rate)*
![](https://i.imgur.com/05nopPh.png)
* We can assign learning rate against each parameters different
* Recall fixed learning rate
$$
\theta^{t+1}_i\leftarrow\theta^t_i-\eta g^t_i\\
g^t_i=\frac{\partial L}{\partial\theta_i}|_{\theta=\theta^t}
$$
* For one parameter
$$
\theta^{t+1}_i\leftarrow\theta^t_i-\frac{\eta}{\sigma^t_i} g^t_i
$$
* $\sigma$ can be computed by Root Mean Square
1. $\theta^1_i\leftarrow\theta^0_i-\frac{\eta}{\sigma^0_i}g^0_i$, $\sigma^0_i=\sqrt{(g^0_i)^2}=|g^0_i|$
2. $\theta^2_i\leftarrow\theta^1_i-\frac{\eta}{\sigma^1_i}g^1_i$, $\sigma^1_i=\sqrt{\frac{1}{2}[(g^0_i)^2+(g^1_i)^2]}$
* This method is used in **Adagrad**
$$
\theta^{t+1}_i\leftarrow\theta^t_i-\frac{\eta}{\sigma^t_i}g^t_i \\
\sigma^t_i=\sqrt{\frac{1}{t+1}\sum^{t}_{i=0}(g^t_i)^2}
$$
* This method does not solve the issue where the parameter needs different learning rate at different times
* **RMSProp** can add parameter $\alpha$ to the portion which previous learning rate affect the new learning rate
$$
\sigma^t_i=\sqrt{\alpha(\sigma^{t-1}_i)^2+(1-\alpha)(g^t_i)^2}
$$
* **Adam** is the optimization method where RMSProp and Momentum is used
* **Learning Rate Scheduling**
* By switching from fixed $\eta$ to a time based learning rate $\eta^t$
$$
\theta^{t+1}_i\leftarrow\theta^t_i-\frac{\eta^t}{\sigma^t_i}g^t_i
$$
* Decay
* As training progresses, the learning rate reduce as we are closing in to destination
* Warm Up
* Start to low learning rate, grows to full learning rate then decay
* Summary
$$
\theta^{t+1}_i\leftarrow\theta^t_i-\frac{\eta^t}{\sigma^t_i}m^t_i
$$
* $\sigma^t_i$: RMS of gradient (only magnitiude)
* $m^t_i$: Momentum: weighted sum of previous gradient (Direction included)
* $\eta^t$: Learning rate scheduling
## Classification
* We can still use the regression method to do classification. However a single scalar may create a bias where one class might be more similiar compared to other class
* eg: Define 1->Class A, 2->Class B, 3->Class C
* This assumes class A is closer to class B than class C
* To mitigate this, we use one-hot vector to represent each class
* Class A: $\hat{y}=\begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}$, Class B: $\hat{y}=\begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}$, Class C: $\hat{y}=\begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix}$
* Now $y$ is a vector instead of a scalar
* Since our values in $y$ can be arbitary. We need to normalize them in order to calculate loss against $\hat{y}$
* Apply $y'=softmax(y), y'_i=\frac{exp(y_i)}{\sum\limits_{j}{y_i}}$
* Actually in binary classification, perform softmax and sigmoid are the same.
* We can use cross-entropy to calculate our loss instead of using mean square error (MSE)
* Recall MSE $e=\sum\limits_{i}{(\hat{y_i}-y'_i)^2}$
* Cross-entropy $e=-\sum\limits_{i}{\hat{y_i}\ln{y'_i}}$
* Recall Loss $L=\frac{1}{N}\sum\limits_{n}{e_n}$
* Minimizing cross-entropy is the same as maximizing likelihood
* ![](https://i.imgur.com/pCICq5p.png)
* In this sample, if initial start point is on the top left, without adaptive learning rate, it may get stuck without a good way to reach better loss
* Using cross entropy smooths out the gradient on the high loss side.
## Normalization
* If two dimensions have different scale (eg $x_1$ is 1, 2.... $x_2$ is 1000,2000...). A difference applied on one of the weight can cause the loss change to be significant compared to the other weight.
* By performing normalization to feature $x$, we attempt to turn all dimension to the same scale and level
* Feature normalization
* Example: for feature vectors $x^1,x^2,...x^R$
* We take all scalar from the same dimension across all feature vectors: $x^1_1,x^2_1,...,x^R_1$
* For each dimension $i$ calculate mean $m_i$ and standard deviation $\sigma_i$
* Set the value to $\tilde{x}^r_i\leftarrow\frac{x^r_i-m_i}{\sigma_i}$
* $\tilde{x}$ represents values that have been normalized
* After normalization, the mean of a dimension is 0 and variances are 1
* In deep learning, the output of the previous level to the next may also need to be normalized
* However doing normalization before or after activation function don't really matter
* Using sigmoid, performing normalization before activation function can perhaps be more beneficial due to the slope near 0
* Performing feature normalization here requires the mean $\mu$ and $\sigma$ to be computed and applied to the output of weight
* This indicates that a change to one of the weight or feature vector will cause the **entire** output to change, rather than the single neuron.
* ![](https://i.imgur.com/YDXTPnw.png)
* It is not feasible to place all training data onto the computing device at once, therefore only the batch that's being processed is considered. This is called **Batch Normalization**
* Only applicable to larger batches
* When testing, it is not feasible to create a batch, most of the time we want the output of a single sample. However in the normalization process, a $\mu$ and $\sigma$ is required.
* Compute moving average of $\mu$ and $\sigma$ during training
* $\bar{\mu}\leftarrow p \bar{\mu}+(1-p)\mu^t$
* $p$ is hyperparameter
## Convolution Neural Network (CNN)
* CNN is commonly used for image based neural network.
* With image as input, a fully connected network may not be feasible (All neuron must have a weight with each input dimension) and not necessary.
* Often in image recognition, the whole image is not needed to identify special patterns to draw conclusion.
* We define areas called **Receptive field**. Each neuron only needs to see a single receptive field
* Receptive field can overlap, different neuron can read from the same receptive field, receptive field can be different in size/channels/shapes...etc
* The design is completely based on domain knowledge of the problem
* A receptive field is most commonly set up with a kernel size of 3x3 across all channels, each receptive field has a set of neurons (eg 64 neurons)
* The offset towards the next receptive field is called stride. Commonly set to a low value and lower than kernel size for receptive fields to overlap.
* If receptive field exceeds the boundary of the image, padding is added to the exceeding area (eg fill 0).
* The same function neurons (eg neurons detecting beak but is responsible of different receptive field) can share parameters (Parameter sharing). Their weight and bias are the same
* Two neurons with the same receptive field would not share parameter. Their output would be the same.
* For neurons sharing parameter, each set of shared parameter neurons is called **filter**
* A convolution layer is when receptive field and parameter sharing is used to limit the model for images
* Pooling is a technique to reduce the size. After applying a filter, we can apply max pooling to pick the max result in a area from the subimage produced.
* Pooling can hurt the performance as small details may be removed in the process
* When the convolutions and poolings are done, we **flatten** the result into feature vectors and feed into a fully connected layers to get the result.
* ![](https://i.imgur.com/M2oWwLY.png)
* CNN is incapable to scaling and rotation. Data augmentation is needed to train network to recognize different sizes and rotations.
## Confusion Matrix
![](https://i.imgur.com/9I1v2s0.png)
![](https://i.imgur.com/mLeMMfo.png)
![](https://i.imgur.com/ikJ55yS.png)
The diagnoal line represent random chance (Weight and bias is decided completely based on random) If the result shows a line below the random chance line, Just swap the line