# WEEK 7 (4-8/1/2021)
## Intuition of DEEP LEARNING (Mon, 4/1/2021)
### AI vs ML vs DL

==**Machine learning**==: a program developed by allowing a computer to learn from its experience, rather than through manually coding every individual steps
==**Deep learning**==: a modern area of machine learning, consists of neural networks with multiple layers in between

### Why deep learning? learn underlying features directly from data

**WHY NOW?**
1. Big Data (larger dataset, easier collection and storage)
2. Hardware (GPUs)
3. Software (new models, toolboxes, techniques,...)
### Artificial neural network


**==The activation function==** (e.g. sigmoid)
Purpose: to ==**introduce non-linearities**== into the network
!!! ALL activation functions are ==non-linear==
Sigmoid and Tanh disadvantage: reaching zero--->no update on weight ---> vanishing gradient problem
### **==Rectified linear unit (ReLU)
---> remove problem of vanishing gradient
### ==**The cost function**==
* ==Binary cross entropy== can be used with models that output a probability between 0 and 1
* ==Mean squared error loss== can be used with regression models that output continuous real numbers
### Training Neural Networks
#### ==**Loss optimization**==:m find the network weights that achieve the lowest loss
### Gradient descent

### Hyperparameters tuning
* Small learning rate converges slowly and get stuck in local minima
* Large learning rate overshoots, become unstable and diverge
* Weight initialization for Breaking Symmetry
* Number of Hidden layers
* Number of neurons per hidden layer
* Batch size ...
### Logistic regression

## Mechanics of Tensorflow (Tue, 5/1/2021)
### **==TENSORFLOW==**
* end-to-end open source platform for ML
* comprehensive, flexible ecosystem of tool, libraries, resources
**TensorFlow offers**:
* similar to Numpy, but with **GPU/TPU support**
* supports **distributed computing** across multiple devices and servers
* TF can extract the **computation graph** from a Python function, then optimize it
* Computation graphs can be export to a **portable format**, so you can run it in another environment (e.g. mobile device)
### **==Computation Graph==**(TF buld graphs to describe computations)
STEPS:
* Building graphs, eg designing the NN architecture
* Initialize a session
* Data in & out: send data in and out of the graph in the session to be doing computation

### TensorFlow basics
#### Tensor
Constant tensor (is immutable)
```
x = tf.constant([[5, 2], [1, 3]])
print(x)
```
Output
```
tf.Tensor( #tf.Tensor is constant and immutable
[[5 2]
[1 3]], shape=(2, 2), dtype=int32)
```
#### ==NumPy Compatibility==
```
x.numpy() # get values as a Numpy array
```
---
```
x.dtype)
x.shape)
```
---
```
tf.ones(shape=(2, 1))
tf.zeros(shape=(2, 1))
```
---
```
tf.random.normal(shape=(2, 2), mean=0., stddev=1.)
```
---
#### ==Variables==: special tensors used to store mutable state
```
initial_value = tf.random.normal(shape=(2, 2))
a = tf.Variable(initial_value)
```
* Assign new value for Variable, you need to use .assign()
```
new_value = tf.random.normal(shape=(2, 2))
a.assign(new_value)
```
---
* += for Variable
```
added_value = tf.random.normal(shape=(2, 2))
a.assign_add(added_value) #ADD
a.assign_sub(added_value) #SUBSTRACT
```
---
#### ==Comuputing gradients with **GradientTape**==
```
a = tf.constant([3.])
b = tf.constant([4.])
with tf.GradientTape() as tape:
#tape.watch(a) # Start recording the history of operations applied to `a` !!variables are watched automatically, so you don't need to manually watch them
c = tf.sqrt(tf.square(a) + tf.square(b)) # Do some math using `a`
# What's the gradient of `c` with respect to `a`?
dc_da = tape.gradient(c, a)
```
### **==Keras API==**

Using Keras to load dataset:
```
fashion_mnist = tf.keras.datasets.fashion_mnist
(X_train_full, y_train_full), (X_test_full, y_test) = fashion_mnist.load_data()
```
---
Creating validation test:
```
X_valid, X_train = X_train_full[:5000] / 255.0, X_train_full[5000:] / 255.0
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
X_test = X_test_full / 255.0
```
---
```
class_names = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat",
"Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]
```
#### **The sequential API**
* composed of a single stack of layers connected sequentially
!!! **==EACH neuron has 1 bias==**
**==NOTE==**: `sparse_categorical_crossentropy` is used when the target class is interger, e.g. 0 to 9. If instead we had one-hot vectors, then we would need to use the `categorical_crossentropy` loss instead.
#### **Functional API**

#### **Subsclassing API**: build custom forward path
## Neural Network from scratch (Wed, 6/1/2021)
### The cost $J$ (binary cross-entrophy) is as follows:
$$J = - \frac{1}{m}\sum\large\left(\small Y.\log\left(A^{[2]}\right) + (1-Y)\log\left(1- A^{[2]}\right) \large \right) \small \tag{6}$$
### STEPS to build a neural network
1. Define the neural network structure ( # of input units, # of hidden units, etc).
2. Initialize the model's parameters
3. Loop:
- Implement forward propagation
- Compute loss
- Implement backward propagation to get the gradients (with the help of Tensorflow GradientTape)
- Update parameters (one step of gradient descent)
### Initialize the model's parameters
* initialize the weights matrices with random values.
- Use: ```tf.random.uniform(shape=(a, b), minval=0, maxval=0.01)``` to randomly initialize a matrix of shape (a,b).
* initialize the bias vectors as zeros.
- Use: ```tf.zeros(shape=(a, b))``` to initialize a matrix of shape (a,b) with zeros.
## Into to Tensorflow/Keras (Thu, 7/1/2021)
train: repeat---> shuffle---> batch---> prefetch
testing: batch---> prefect
!!!**==weights and bias need to be variable to be watched in GradientTape==**!!!
## Fine tuning DNN/cheatsheet (Fri, 8/1/2021)
### Mini batch
* Stochastic gradient descent: batch size 1 ie one example at a time to take a single step
* Gradient descent: batch size m ie all the training data is taken into consideration to take a single step
* **==Mini-Batch gradient descent==**: a batch of a fixed number of training examples to train a single step

### Learning rate
#### **==Impact of learning rate==**

---> start with **small** then increase gradually
##### Vanishing/exploding gradient problems

##### Learning rate decay

### Prevent Overfitting
==**Training loss/ Validation loss**==

==**TECHNIQUES TO FIX**==
1. Early stopping: stop training before we have a chance to overfit

2. Dropout: during training, randomly set some activations to 0---> forces network to not rely on any 1 node
3. Weights regularization - L1&L2:

4. Data augmentation: horizontal flips, color, contrast
5. Normalize inputs: normalize back to normal distribution before pass it through activcation function
### Optimizers (how parameters are updated)
#### Faster optimizers
* Faster **==MOMENTUM==** helps free from local min with GD
* **==Adam==** (have both momentum and learning rate decay)
* RMSprop
* SGD
* SGD + Momentum