{%hackmd theme-dark %} # Machine Learning ###### tags: `ML` `大學` ## Types of functions * Regression: Determine a scalar output (Eg predict pm2.5) * Classification: Classify for results (Eg AlphaGo) * Structured Learning: Generate structured data (Create docs, imgs) ## Steps of machine learning * Guess **Model** function structure based on domain knowledge * Ex $y=b+wx_1$, where $b$ (bias) and $w$ (weight) are unknown parameters to be derived from data * Define **Loss** from training data, loss is how good a set of values is L(b,w) * Label is the correct value * Delta between label and predicted value based on a set of parameter is called $e$ * $e = |y-\hat y|$ => Mean Absolute Error (MAE) * $e = (y-\hat y)^2$ => Mean Square Error (MSE) * if $y$ and $\hat y$ are both probability distributions => Cross-entropy * Drawing a graph of loss is based on different parameters is called **error surface** * ![](https://i.imgur.com/4oO2vOg.png) * **Optimization**: method we use is is called Gradient Descend. * Example shows single parameter, real case is $w^*,b^*=arg \min\limits_{w,b}L$ * Pick a init value $w_0$ * Compute diffentiate on $w=w_0$ (Slope of $w_0$ on error surface 1D single param) * If slope is negative, increase $w$ by $\eta \frac{\partial L}{\partial w}$ * If slope is positive, increase $w$ by $\eta \frac{\partial L}{\partial w}$ * $\eta$ is *learning rate* * Anything we set ourselves is called **hyperparameters** * Update $w$, $w_1 \leftarrow w_0 \pm \eta \frac{\partial L}{\partial w}$ * :::warning Optimization can create a issue where a local minima is found but not a global minima. However this is not really a problem ::: ## More flexible model * Model bias is a bias introduced by the fact that model cannot represent the real data * A non linear function can be approximated by a piecewise function given enough points are sampled * To represent a function, we can use Sigmoid Function $y=c \frac{1}{1+e^{-(b+wx_1)}}$ => $c\,sigmoid(b+wx_1)$ * ![](https://i.imgur.com/o5KbzLM.png) * Summary * ![](https://i.imgur.com/4DVgTNp.png) * ![](https://i.imgur.com/JTzgwVP.png) * Batch, update and epoch * For the big vector $\theta$, we may cut up the vector to smaller pieces called batch, each batch will update the variables once. An epoch is when all batch ran once. * Eg: 10,000 examples (N=10,000) Batch size is 10 (B = 10) * 1 epoch contains 1000 updates * Eg: 1,000 examples (N=1000) Batch size is 100 (B = 100) * 1 epoch contains 10 updates * ReLU (Rectified Linear Unit) * $c\,max(0,b+wx)$ * Sigmoid and ReLU here is called **Activation function** ## General Guide ![](https://i.imgur.com/0mNtJvE.png) ## Local minima and saddle point * In order to differentiate local minima and saddle point when in a critical point (gradient=0). We need to find the function representing loss * It is not feasible to find the real function of $L(\theta)$. However we can approximate $L(\theta)$ around $\theta=\theta'$: (Tayler Series Approximation) $$ L(\theta)\approx L(\theta')+(\theta-\theta')^Tg+\frac{1}{2}(\theta-\theta')^TH(\theta-\theta') $$ Gradient $g$ is a vector, $g=\nabla L(\theta')$, $g_i=\frac{\partial L(\theta')}{\partial \theta_i}$ Hessian $H$ is a matrix, $H_{ij}=\frac{\partial^2}{\partial\theta_i\partial\theta_j}L(\theta')$ ![](https://i.imgur.com/sipxU3r.png) * When on a critical point the term $(\theta-\theta')^Tg$ would be 0, and the remaining term $\frac{1}{2}(\theta-\theta')^TH(\theta-\theta')$ represents the properties of critical point * Let $v=(\theta-\theta')$, $L(\theta)\approx L(\theta')+\frac{1}{2}v^THv$ * For all v $v^THv>0$: Around $\theta'$ $L(\theta)>L(\theta')$: Local minima * $H$ is positive definite, all eigen values are positive * For all v $v^THv>0$: Around $\theta'$ $L(\theta)>L(\theta')$: Local maxima * $H$ is negative definite, all eigen values are negative * For some v $v^THv>0$, for some v $v^THv<0$: Saddle point * Some eigen values are positive, some are negative * When on saddle point, we can use $H$ to find the directions we should update. * Find the eigenvector of the negative eigenvalue of H * Update the parameter along the direction of any of the eigenvector * Rarely used as calculation cost is high ## Batch and Momentum * Large batch size causes less updates than small batch size * With parallel computation large batch has no difference in speed until the batch is really too large * The time for one epoch is shorter with large batch * Gradient on small batch size tends to jump around (noisy) * Optimization is better with small batch size as perhaps small batch size is easier to avoid settling in a steep local minima compared to a flatter local minima * Momentum adds additional "momentum" when updating parameters according to gradient * Momentum is calculated according to the previous movement and amount * $m^i$ is actually the weighted sum of all previous gradient $g^0,g^1...$ * ![](https://i.imgur.com/QK13j80.png) ## Adaptive learning rate * Most of the time critical point is not the reason of stuck loss * Gradient descend is really bad when the learning rate is not optimized to the current condition * Oscillating between two peeks and never falling into the minima *(Need lower learning rate)* ![](https://i.imgur.com/7hMBbmp.png) * Never reaching the low point *(Need higher learning rate)* ![](https://i.imgur.com/05nopPh.png) * We can assign learning rate against each parameters different * Recall fixed learning rate $$ \theta^{t+1}_i\leftarrow\theta^t_i-\eta g^t_i\\ g^t_i=\frac{\partial L}{\partial\theta_i}|_{\theta=\theta^t} $$ * For one parameter $$ \theta^{t+1}_i\leftarrow\theta^t_i-\frac{\eta}{\sigma^t_i} g^t_i $$ * $\sigma$ can be computed by Root Mean Square 1. $\theta^1_i\leftarrow\theta^0_i-\frac{\eta}{\sigma^0_i}g^0_i$, $\sigma^0_i=\sqrt{(g^0_i)^2}=|g^0_i|$ 2. $\theta^2_i\leftarrow\theta^1_i-\frac{\eta}{\sigma^1_i}g^1_i$, $\sigma^1_i=\sqrt{\frac{1}{2}[(g^0_i)^2+(g^1_i)^2]}$ * This method is used in **Adagrad** $$ \theta^{t+1}_i\leftarrow\theta^t_i-\frac{\eta}{\sigma^t_i}g^t_i \\ \sigma^t_i=\sqrt{\frac{1}{t+1}\sum^{t}_{i=0}(g^t_i)^2} $$ * This method does not solve the issue where the parameter needs different learning rate at different times * **RMSProp** can add parameter $\alpha$ to the portion which previous learning rate affect the new learning rate $$ \sigma^t_i=\sqrt{\alpha(\sigma^{t-1}_i)^2+(1-\alpha)(g^t_i)^2} $$ * **Adam** is the optimization method where RMSProp and Momentum is used * **Learning Rate Scheduling** * By switching from fixed $\eta$ to a time based learning rate $\eta^t$ $$ \theta^{t+1}_i\leftarrow\theta^t_i-\frac{\eta^t}{\sigma^t_i}g^t_i $$ * Decay * As training progresses, the learning rate reduce as we are closing in to destination * Warm Up * Start to low learning rate, grows to full learning rate then decay * Summary $$ \theta^{t+1}_i\leftarrow\theta^t_i-\frac{\eta^t}{\sigma^t_i}m^t_i $$ * $\sigma^t_i$: RMS of gradient (only magnitiude) * $m^t_i$: Momentum: weighted sum of previous gradient (Direction included) * $\eta^t$: Learning rate scheduling ## Classification * We can still use the regression method to do classification. However a single scalar may create a bias where one class might be more similiar compared to other class * eg: Define 1->Class A, 2->Class B, 3->Class C * This assumes class A is closer to class B than class C * To mitigate this, we use one-hot vector to represent each class * Class A: $\hat{y}=\begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}$, Class B: $\hat{y}=\begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}$, Class C: $\hat{y}=\begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix}$ * Now $y$ is a vector instead of a scalar * Since our values in $y$ can be arbitary. We need to normalize them in order to calculate loss against $\hat{y}$ * Apply $y'=softmax(y), y'_i=\frac{exp(y_i)}{\sum\limits_{j}{y_i}}$ * Actually in binary classification, perform softmax and sigmoid are the same. * We can use cross-entropy to calculate our loss instead of using mean square error (MSE) * Recall MSE $e=\sum\limits_{i}{(\hat{y_i}-y'_i)^2}$ * Cross-entropy $e=-\sum\limits_{i}{\hat{y_i}\ln{y'_i}}$ * Recall Loss $L=\frac{1}{N}\sum\limits_{n}{e_n}$ * Minimizing cross-entropy is the same as maximizing likelihood * ![](https://i.imgur.com/pCICq5p.png) * In this sample, if initial start point is on the top left, without adaptive learning rate, it may get stuck without a good way to reach better loss * Using cross entropy smooths out the gradient on the high loss side. ## Normalization * If two dimensions have different scale (eg $x_1$ is 1, 2.... $x_2$ is 1000,2000...). A difference applied on one of the weight can cause the loss change to be significant compared to the other weight. * By performing normalization to feature $x$, we attempt to turn all dimension to the same scale and level * Feature normalization * Example: for feature vectors $x^1,x^2,...x^R$ * We take all scalar from the same dimension across all feature vectors: $x^1_1,x^2_1,...,x^R_1$ * For each dimension $i$ calculate mean $m_i$ and standard deviation $\sigma_i$ * Set the value to $\tilde{x}^r_i\leftarrow\frac{x^r_i-m_i}{\sigma_i}$ * $\tilde{x}$ represents values that have been normalized * After normalization, the mean of a dimension is 0 and variances are 1 * In deep learning, the output of the previous level to the next may also need to be normalized * However doing normalization before or after activation function don't really matter * Using sigmoid, performing normalization before activation function can perhaps be more beneficial due to the slope near 0 * Performing feature normalization here requires the mean $\mu$ and $\sigma$ to be computed and applied to the output of weight * This indicates that a change to one of the weight or feature vector will cause the **entire** output to change, rather than the single neuron. * ![](https://i.imgur.com/YDXTPnw.png) * It is not feasible to place all training data onto the computing device at once, therefore only the batch that's being processed is considered. This is called **Batch Normalization** * Only applicable to larger batches * When testing, it is not feasible to create a batch, most of the time we want the output of a single sample. However in the normalization process, a $\mu$ and $\sigma$ is required. * Compute moving average of $\mu$ and $\sigma$ during training * $\bar{\mu}\leftarrow p \bar{\mu}+(1-p)\mu^t$ * $p$ is hyperparameter ## Convolution Neural Network (CNN) * CNN is commonly used for image based neural network. * With image as input, a fully connected network may not be feasible (All neuron must have a weight with each input dimension) and not necessary. * Often in image recognition, the whole image is not needed to identify special patterns to draw conclusion. * We define areas called **Receptive field**. Each neuron only needs to see a single receptive field * Receptive field can overlap, different neuron can read from the same receptive field, receptive field can be different in size/channels/shapes...etc * The design is completely based on domain knowledge of the problem * A receptive field is most commonly set up with a kernel size of 3x3 across all channels, each receptive field has a set of neurons (eg 64 neurons) * The offset towards the next receptive field is called stride. Commonly set to a low value and lower than kernel size for receptive fields to overlap. * If receptive field exceeds the boundary of the image, padding is added to the exceeding area (eg fill 0). * The same function neurons (eg neurons detecting beak but is responsible of different receptive field) can share parameters (Parameter sharing). Their weight and bias are the same * Two neurons with the same receptive field would not share parameter. Their output would be the same. * For neurons sharing parameter, each set of shared parameter neurons is called **filter** * A convolution layer is when receptive field and parameter sharing is used to limit the model for images * Pooling is a technique to reduce the size. After applying a filter, we can apply max pooling to pick the max result in a area from the subimage produced. * Pooling can hurt the performance as small details may be removed in the process * When the convolutions and poolings are done, we **flatten** the result into feature vectors and feed into a fully connected layers to get the result. * ![](https://i.imgur.com/M2oWwLY.png) * CNN is incapable to scaling and rotation. Data augmentation is needed to train network to recognize different sizes and rotations. ## Confusion Matrix ![](https://i.imgur.com/9I1v2s0.png) ![](https://i.imgur.com/mLeMMfo.png) ![](https://i.imgur.com/ikJ55yS.png) The diagnoal line represent random chance (Weight and bias is decided completely based on random) If the result shows a line below the random chance line, Just swap the line