Deep Learning - HackMD

# Deep Learning (https://www.deeplearningbook.org/) > [TOC] ## Course 1 - Neural Networks * **Vectorization significantly speeds up the algorithms, reduce the usage of for loops** * **Always keep matrix Dimensions in mind** ### Basic Flow of a Neural Network 1. Implement functions to calculate activations and Stuff. 2. Implement a Random Intializer. *(Random Initialization in important to Break symmetry)* 3. Implement Forward Propagation. *(Make sure this only takes X and parameters as input as it will be used to make predictions as well)* 4. Calculate Cost. 5. Implement back propagation. 6. Implement Functions to update parameters. *(Can be incorporated into the main loop)* 7. Implement the main loop. **Make Sure to plot the costs to ensure it trends down** ### How NN's Work (Trivial) Neural networks try something similar to a Linear Classifier/Regressor as in they try to learn a functions and reduce the loss function across all training samples as much as they can. A Logistic Regression can be considered the most simplest neural network. ### Cross Entropy Cost function $$ J(y,\hat{y}) = -(y\log{\hat{y}} + (1-y)\log{(1-\hat{y})}) $$ For m training examples $$ L(y,\hat{y}) = {1\over m}\sum_{k=0}^m-(y\log{\hat{y}} + (1-y)\log{(1-\hat{y})}) $$ ### Updating Parameters The learnable parameters $W^{[l]}$ and $b^{[l]}$ are updated as such, $$ W^{[l]} = W^{[l]} - \alpha {dL \over dW} $$ $$ b^{[l]} = b^{[l]} - \alpha {dL \over db} $$ *(Partial differentiation is used here, could'nt find how to write del)* This updating of parameters is responsible for reducing the Cost function and therefore these are the parameters the neural network learns. ### Activation Functions Activations functions are used to improve performance of the NN. Useful Activation functions for Hidden layers are $\tanh$ function and **ReLU** functions *ReLU - Rectified Linear Units* For the Output Layer Activation function depends on the type of output you need, for example a **Sigmoid** function can be used if the output needs to be between 0 and 1 (For Binary Classification), For Multi-Class Classification, **Softmax** activation can be used. Usually a neural network will have a flow like Base_function->Activation_function Linear->ReLU $$ ReLU(x) = max(0,x) $$ $$ \tanh{(x)} = {e^x - e^{-x} \over e^x+e^{-x}} $$ $$ \sigma(x) = {1 \over 1+e^{-x} } $$ ## Course 2 - Improving Deep Nets **In all the Descent Algorithms b should also be updated similar to W** ### Regularization To regularize/reduce overfitting, We change the cost function to, $$ J(y,\hat{y},W) = {1\over m}\sum_{k=0}^m-(y\log{\hat{y}} + (1-y)\log{(1-\hat{y})}) + {\lambda \over 2m} \sum_{l=0}^L {||W^{[l]}||}^2 $$ $\lambda$ is the Regularization Hyperparameter $\lambda$ *is directly proportional to* **Extent of Regularization** The Entire Term on the right *(except the scaling constants)* is called the **Frobenius Norm**, it is very similar to the L2 norm L1 norm can also be used, but it will make the model sparse, just like in a Lasso Regressor, use L1 norm if you want to make a Parsimonious Model ### Dropout Regularization Dropout regularizations ignores outputs from some neurons of the layer with some probability, this helps regularize the output as it reduces the dependency of the the next layer on the last by some neurons. ### Normalizing Inputs Xavier/Glorot Initialization Multiply Random Initialization with $$ \sqrt{1 \over n} $$ > Where n is the size of previous layer *(or for first layer, size of input features)* He Initialization (Use When using ReLU) $$ \sqrt{2 \over n} $$ Others: $$ \tanh{\sqrt{1 \over n}} $$ ### Exponentially Moving Averages $V_{dW} = \beta{V_{dW}}+(1-\beta)dW$ this approximately implies that we are averaging over $1\over{1-\beta}$ entries #### Bias Correction To correct the bias created initially by exponentially moving averages, instead of using $V_{dW}$ use $V_{dw}\over{1-\beta^t}$ ### Gradient Descent With Momentum Using Exponentially weighted averages. Calculate $V_{dW}$ as follows, $V_{dW} = \beta{V_{dW}}+(1-\beta)dW$ then update parameters as, $$ W = W - \alpha{V_{dW}} $$ $\beta$ is usually 0.9 ### RMS Prop Using Exponentially weighted averages. Calculate $S_{dW}$ as follows, $S_{dW} = \beta{S_{dW}}+(1-\beta)dW^2$ then update parameters as, $$ W = W - \alpha{ dW\over \sqrt{S_{dW}} + \epsilon} $$ ### Adaptive Moment Estimation (ADAM) Using Exponentially weighted averages. Calculate $V_{dW}$ as follows, $V_{dW} = \beta_{1}{V_{dW}}+(1-\beta_{1})dW$ and calculate $S_{dW}$ as follows, $S_{dW} = \beta_{2}{S_{dW}}+(1-\beta_{2})dW^2$ Then correct them using bias correction $V_{dW}^{corrected} = {V_{dW} \over 1-\beta_{1}^t}$ $S_{dW}^{corrected} = {S_{dW} \over 1-\beta_{2}^t}$ Where t is number of iterations/epochs Then update parameters as, $$ W = W - {\alpha{V_{dW}^{corrected}} \over \sqrt{S_{dW}^{corrected}} + \epsilon} $$ Usually *(As suggested by authors of ADAM)* $\beta_1$ = 0.9 $\beta_2$ = 0.99 $\epsilon$ = $10^{-8}$ ### Batch Norm(alization) Batch Norm means normalizing the outputs of each hidden layer of the Neural Network before feeding it to the next layer ### Softmax It is used for Multi-class classification When using softmax, the loss function is changed to, $$ L(y,A) = {1\over m}\sum_{k=0}^m-(y\log{A}) $$ Also scaled to number of output classes, this means that the network should increase the probability of the required class, the term $-(1-y)\log{(1-\hat{y})}$ makes the neural network reduce the probability of the not required classes ## Course 4 - Convolutional NN's ConvNets, instead of using simple units, try to predict filters which are convolved with the base image (actually cross-correlated) to detect patterns in the image. Extremely effective for image detection ### Formula to keep in mind $n$ = size of image $f$ = size of filter $s$ = stride $p$ = padding $$ n'= {n+2p-f\over s} +1 $$ *if the answer is non-integer it is rounded down* *The convolution here is just cross-correlation, but since the neural network is going to learn it doesn matter anyways* ### ResNets ResNets allow the NN to transfer the output of one layer to another layer 2 or 3 steps deeper in the network, this allows the NN to learn Identity functions too as well prevent as prevent the drop in performance drop when new layers are added. ### Inceptions Nets Inceptions Nets allow the Network to use different kind of functions in the same layer in the same. ### Face Recognition To achieve one shot learning in Face Recognition, we need to learn a similarity function. to do this we first need to make a model that can encode images *(this model should be extremely accurate)* the encodings of this model is then used **(This encoder which creates outputs meant to be compared is called a Siamese Neural network)** *https://en.wikipedia.org/wiki/Siamese_neural_network* #### **Triplet Loss** It is the loss function used to train a Siamese network for face recognition. $$ L(A,P,N) = max( ||f(A) - f(P)||^2-||f(A) - f(P)||^2+\alpha,0) $$ where, $A =$ **Anchor** $P =$ **Positive** $N =$ **Negative** $f =$ **Encoding Function** It is recommended that the the triplet of P,N,A is carefully chosen so that N and P are not very different looking people. An alternative to using the triplet loss is, inputting the outputs from the trained model into a logistic regression as $$ \hat{y} = Activation(\sum W|f(x^{(i)})-f(x^{(j)})|+b) $$ to tell whether it images $x^{(i)}$ and $x^{(j)}$ are of the same person(1) or not(0). Additionally, instead of direct difference of $f(x^{(i)})$ and $f(x^{(j)})$ you can use the $\chi$-squared distance as: $$ (f(x^{(i)})-f(x^{(j)}))^2\over {f(x^{(i)})+f(x^{(j)})} $$ ## Course 5 ### RNN's Recurrent neural networks use the same set of parameters W and b for all time steps, they are usually not deep, as use as much as just 3 layers at most. At each timestep there are these parameters: 1. $W_{aa}$ - multiplied by previous timestep's a to get current a. *(a is activation)* 2. $W_{ax}$ - multiplied by current timestep's input(x) to get current a. 3. $b_{a}$ 4. $W_{ya}$ - multiplied by current layers acrivation to get get current layers output(y). 5. $b_{y}$ $$ a^{<t>} = \tanh(W_a.(a^{<t-1>},x^t)+b_{a}) $$ $$ y^{<t>} = \sigma(W_y.(a^{<t>})+b_{y}) $$ Waa and $W_{ax}$ can be concatenated horizontally to get $W_a$ and $a^{[t-1]}$ and $x^{[t]}$ can be stacked vertically, these can then be directly matrix multiplied. ### GRU (Gated Recurrent Unit) Gated Recurrent Units has an extra memory cell, At every time step the memory cell can be changed by a candidate cell $\tilde{c}$ $$ \tilde{c} = \tanh(W_c[c^{[t-1]},x^t]+ b_c) $$ *Here $W_c$ and $b_c$ are new parameters* The gate (update gate) is calculated as: $$ \Gamma_u = \sigma(W_u[c^{<t-1>},x^t]+b_u) $$ Now wheter or not $c$ is updated to $\tilde{c}$ is decided by: $$ c^{<t>} = \Gamma_u\tilde{c}^{<t>} + (1-\Gamma_u)c^{<t-1>} $$ In this way GRU is able to store data in its memory cell for long time as in usual RNN the information is decayed with timesteps. A Full GRU uses $\Gamma_r$, a relevance gate and is used in the calculation of $\tilde{c}$ as: $$ \tilde{c} = \tanh(\Gamma_r*W_c.[c^{[t-1]},x^t]+ b_c) $$ ### LSTM (Long Short Term Memory) Just like a GRU, LSTM uses gates, but instead of just two it uses 3 gates: $\Gamma_u$ - update gate $\Gamma_u$ - forget gate $\Gamma_u$ - output gate They are used just as the name suggest: $$ \tilde{c} = \tanh(W_c[c^{[t-1]},x^t]+ b_c) $$ $$ \Gamma_u = \sigma(W_u[c^{<t-1>},x^t]+b_u) $$ $$ \Gamma_f = \sigma(W_f[c^{<t-1>},x^t]+b_f) $$ $$ \Gamma_o = \sigma(W_o[c^{<t-1>},x^t]+b_o) $$ $$ c^{<t>} = \Gamma_u\tilde{c}^{<t>} + \Gamma_fc^{<t-1>} $$ $$ a^{<t>} = \Gamma_oc^{<t>} $$ ### Bi-Directional RNN's They are pretty much what you think ### Deep RNN's Very Deep RNN's are scarcely used, but they are just like simple RNN's stacked vertically to form layers. --- > [name=ZeoDarkflame] # Points to Remember ## Why the 2nd term in cost function can be omitted when using Softmax In a multi-class classification problem, the (one-hot) of labels will have only a single positive class(correct prediction) as compared to negative classes, removing the 2nd term which calculates/(is responsible for reducing) the probabilities of the negative(incorrect) classes. This in turn makes the calculation of cost more computationally effective reducing computation time by a factor of $1\over c$ where c is the number of classes > Consider the full cost function with both the terms for 10 class output. A label will have 9 zeros. So instead of using the second term 9 times, we can simply optimise the position of the 1. Instead of trying to optimise the other 9 terms to become zero also. This reduces computations by $1\over 10$. As now you only have to do one computation (9 zeros). So for C classes the computations decrease by a factor of $1\over C$ ## Why cost may rise on a seemingly good model when 2nd term of the softmax is included. When argmax (taking value with max probablity) is taken for the output. It is possible that a vector as such [0.01,0.25,0.26,0.06,0.064 .....] (consider 10 classes) also gives the correct output, but actually the probabilities being very close, it is not a good result (as compared to the one-hot we want) So the actual cost is rising, The first term which just cares about the desired output feature becoming 1, decreases but the other features don't very well go to zero, since the gradients start approaching zero as the individual feature labels approach zero Adding the second term is like adding an extra penalty term which is anyway very hard to minimize due to vanishing gradients as stated above. ###### tags: `Deep Learning`