# WEEK 7 (4-8/1/2021) ## Intuition of DEEP LEARNING (Mon, 4/1/2021) ### AI vs ML vs DL ![](https://i.imgur.com/xsvT94h.png) ==**Machine learning**==: a program developed by allowing a computer to learn from its experience, rather than through manually coding every individual steps ==**Deep learning**==: a modern area of machine learning, consists of neural networks with multiple layers in between ![](https://i.imgur.com/ETpP1ze.png) ### Why deep learning? learn underlying features directly from data ![](https://i.imgur.com/s43XH7L.png) **WHY NOW?** 1. Big Data (larger dataset, easier collection and storage) 2. Hardware (GPUs) 3. Software (new models, toolboxes, techniques,...) ### Artificial neural network ![](https://i.imgur.com/QBYha88.png) ![](https://i.imgur.com/tEv5h42.png) **==The activation function==** (e.g. sigmoid) Purpose: to ==**introduce non-linearities**== into the network !!! ALL activation functions are ==non-linear== Sigmoid and Tanh disadvantage: reaching zero--->no update on weight ---> vanishing gradient problem ### **==Rectified linear unit (ReLU) ---> remove problem of vanishing gradient ### ==**The cost function**== * ==Binary cross entropy== can be used with models that output a probability between 0 and 1 * ==Mean squared error loss== can be used with regression models that output continuous real numbers ### Training Neural Networks #### ==**Loss optimization**==:m find the network weights that achieve the lowest loss ### Gradient descent ![](https://i.imgur.com/3FzFSsE.png) ### Hyperparameters tuning * Small learning rate converges slowly and get stuck in local minima * Large learning rate overshoots, become unstable and diverge * Weight initialization for Breaking Symmetry * Number of Hidden layers * Number of neurons per hidden layer * Batch size ... ### Logistic regression ![](https://i.imgur.com/pYJSlIo.png) ## Mechanics of Tensorflow (Tue, 5/1/2021) ### **==TENSORFLOW==** * end-to-end open source platform for ML * comprehensive, flexible ecosystem of tool, libraries, resources **TensorFlow offers**: * similar to Numpy, but with **GPU/TPU support** * supports **distributed computing** across multiple devices and servers * TF can extract the **computation graph** from a Python function, then optimize it * Computation graphs can be export to a **portable format**, so you can run it in another environment (e.g. mobile device) ### **==Computation Graph==**(TF buld graphs to describe computations) STEPS: * Building graphs, eg designing the NN architecture * Initialize a session * Data in & out: send data in and out of the graph in the session to be doing computation ![](https://i.imgur.com/XelNF1H.png) ### TensorFlow basics #### Tensor Constant tensor (is immutable) ``` x = tf.constant([[5, 2], [1, 3]]) print(x) ``` Output ``` tf.Tensor( #tf.Tensor is constant and immutable [[5 2] [1 3]], shape=(2, 2), dtype=int32) ``` #### ==NumPy Compatibility== ``` x.numpy() # get values as a Numpy array ``` --- ``` x.dtype) x.shape) ``` --- ``` tf.ones(shape=(2, 1)) tf.zeros(shape=(2, 1)) ``` --- ``` tf.random.normal(shape=(2, 2), mean=0., stddev=1.) ``` --- #### ==Variables==: special tensors used to store mutable state ``` initial_value = tf.random.normal(shape=(2, 2)) a = tf.Variable(initial_value) ``` * Assign new value for Variable, you need to use .assign() ``` new_value = tf.random.normal(shape=(2, 2)) a.assign(new_value) ``` --- * += for Variable ``` added_value = tf.random.normal(shape=(2, 2)) a.assign_add(added_value) #ADD a.assign_sub(added_value) #SUBSTRACT ``` --- #### ==Comuputing gradients with **GradientTape**== ``` a = tf.constant([3.]) b = tf.constant([4.]) with tf.GradientTape() as tape: #tape.watch(a) # Start recording the history of operations applied to `a` !!variables are watched automatically, so you don't need to manually watch them c = tf.sqrt(tf.square(a) + tf.square(b)) # Do some math using `a` # What's the gradient of `c` with respect to `a`? dc_da = tape.gradient(c, a) ``` ### **==Keras API==** ![](https://i.imgur.com/UaJSjZ7.png) Using Keras to load dataset: ``` fashion_mnist = tf.keras.datasets.fashion_mnist (X_train_full, y_train_full), (X_test_full, y_test) = fashion_mnist.load_data() ``` --- Creating validation test: ``` X_valid, X_train = X_train_full[:5000] / 255.0, X_train_full[5000:] / 255.0 y_valid, y_train = y_train_full[:5000], y_train_full[5000:] X_test = X_test_full / 255.0 ``` --- ``` class_names = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat", "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"] ``` #### **The sequential API** * composed of a single stack of layers connected sequentially !!! **==EACH neuron has 1 bias==** **==NOTE==**: `sparse_categorical_crossentropy` is used when the target class is interger, e.g. 0 to 9. If instead we had one-hot vectors, then we would need to use the `categorical_crossentropy` loss instead. #### **Functional API** ![](https://i.imgur.com/DJvTR1h.png) #### **Subsclassing API**: build custom forward path ## Neural Network from scratch (Wed, 6/1/2021) ### The cost $J$ (binary cross-entrophy) is as follows: $$J = - \frac{1}{m}\sum\large\left(\small Y.\log\left(A^{[2]}\right) + (1-Y)\log\left(1- A^{[2]}\right) \large \right) \small \tag{6}$$ ### STEPS to build a neural network 1. Define the neural network structure ( # of input units, # of hidden units, etc). 2. Initialize the model's parameters 3. Loop: - Implement forward propagation - Compute loss - Implement backward propagation to get the gradients (with the help of Tensorflow GradientTape) - Update parameters (one step of gradient descent) ### Initialize the model's parameters * initialize the weights matrices with random values. - Use: ```tf.random.uniform(shape=(a, b), minval=0, maxval=0.01)``` to randomly initialize a matrix of shape (a,b). * initialize the bias vectors as zeros. - Use: ```tf.zeros(shape=(a, b))``` to initialize a matrix of shape (a,b) with zeros. ## Into to Tensorflow/Keras (Thu, 7/1/2021) train: repeat---> shuffle---> batch---> prefetch testing: batch---> prefect !!!**==weights and bias need to be variable to be watched in GradientTape==**!!! ## Fine tuning DNN/cheatsheet (Fri, 8/1/2021) ### Mini batch * Stochastic gradient descent: batch size 1 ie one example at a time to take a single step * Gradient descent: batch size m ie all the training data is taken into consideration to take a single step * **==Mini-Batch gradient descent==**: a batch of a fixed number of training examples to train a single step ![](https://i.imgur.com/QB9AsCo.png) ### Learning rate #### **==Impact of learning rate==** ![](https://i.imgur.com/Ct1poa4.png) ---> start with **small** then increase gradually ##### Vanishing/exploding gradient problems ![](https://i.imgur.com/GBXOJx6.png) ##### Learning rate decay ![](https://i.imgur.com/SKZFLhA.png) ### Prevent Overfitting ==**Training loss/ Validation loss**== ![](https://i.imgur.com/8oDePFG.png) ==**TECHNIQUES TO FIX**== 1. Early stopping: stop training before we have a chance to overfit ![](https://i.imgur.com/VRyjRGd.png) 2. Dropout: during training, randomly set some activations to 0---> forces network to not rely on any 1 node 3. Weights regularization - L1&L2: ![](https://i.imgur.com/Q9n358A.png) 4. Data augmentation: horizontal flips, color, contrast 5. Normalize inputs: normalize back to normal distribution before pass it through activcation function ### Optimizers (how parameters are updated) #### Faster optimizers * Faster **==MOMENTUM==** helps free from local min with GD * **==Adam==** (have both momentum and learning rate decay) * RMSprop * SGD * SGD + Momentum