# WEEK 8 (11-15/1/2021)
## Convolutional Neural Network (MON, 11/1/2021)
Computer vision problems:
* Single object: classification, classification+localisation(where is the object)
* Multiple objects: object detection, instance segmentation(where object is in terms of pixel value)
* Neural style transfer
Audio problems: music, speech,...
### Dense/fully connected layers Disadvantage:
for large high resolution images that have million feature/pixels, so with the first layer of dense layer---> too many parameters to train---> risk of ==OVERFITTING==
### Convolutional Neural Network

####==**Convolution Operation**==:

---> ==**Summary**==:
* Convolutions are useful layers for processing images
* Scan images to detect useful features
* Are element-wise products and sums
##### **Stride**: how much space the filter will jump to next position while scanning the input (default=1)
---> prevent overfitting by skipping over noise pixels, improve run time
##### **Padding**:same(shape of the input same as shape of output), valid(no padding)
---> treat all pixels equally important

WITH PADDING

##### FORMULA TO CALCULATE OUPUT SHAPE

Formula to padding=same:

#### **Max Pooling layer**

---> reduce computational cost, keep the most important features
!!! not popular for ==last== layer
#### **Average Pooling layer**

Average pooling method smooths out the image and hence the sharp features may not be identified when this pooling method is used.
#### **Global average pooling** better than **global ==max== pooling

SHAPE: 7x7x5
---> use for ==**LAST**== layer
### CNN advantages
* parameter sharing: Parameter sharing is where we:force sets of parameters to be equal
* sparsity of connection: In each layer, the output value in the hidden layer depends only on a few input values. This helps in training with smaller training sets, and also it is less prone to overfitting.
* Capturing translation invariance: image can shift a bit, doesnt freak the fuck out
## Applications of CNN
### Localization

### Object detection
Detecting multiple objects by sliding CNN across the image

### Semantic segmentation
Each pixel is classified according to the class of the object it belongs to

* Uber-auto driving
* Image processing (image signal processor-ISP)
==**USUALLY**==: First 2 dimensions of output shrink but third dimension doubles
## Tensorflow Keras cheatsheet (Tue, 12/01/2021)
### ==**ORDER OF MODEL**==:
1. CNN
2. BatchNorm
3. Activation
4. Drop out at the end
### Batch normalization
To calculate non_trainable parameters for batch norm: add all params of batchnorm layers, divide by 2
==**Popular pipeline**==
```
train_dataset = train_dataset.map(preprocess).cache().repeat().shuffle(1024).batch(BATCH_SIZE).prefetch(tf.data.experimental.AUTOTUNE)
test_dataset = test_dataset.map(preprocess).batch(BATCH_SIZE).prefetch(tf.data.experimental.AUTOTUNE)
```
split val and train using imagedatagenerator:
1. Set the right subset: val for val, train for train
2. Set the same seed
---
TUESDAY NOTE IN NOTE
---
## RNN & LSTM (WED, 13/1/2021)
### Feed Forward Networks (DNN)
#### Disadvantages:
* Assume all inputs (and outputs) are independent to each other
* Does not work with sequential (time-series) data (e.g. predicting the next word in a sentence, reading books, understanding lyrics)
* Need to understand previous output to make predictions
* Doesn't share features learned across different positions of text
### RNN (Recurrent Neural Network)
==**Standard RNN**==

==**RNN types**==
1. one to one
2. one to many
3. many to one
4. many to many
#### RNN applications
* Speech recognition
* Music generation
* Sentiment classification
* DNA sequence analysis
* Machine translation
* Video activity recognition
* Name entity recognition
(**One hot encoder**: quickly turn words into vectors)
##### **==Main problems of RNN==**
##### ==**PROBLEM OF LONG-TERM DEPENDENCIES**==
* Take longer to train, take more resources
* Vanishing gradient problem (worse than DNN) because have to do Backpropagation Through Time (all the way to t=0)
*
### LSTM (Long Short Term Memory)

#### ==**3 main gates**==(regulators)

* Step 1- FORGET GATE:
* discover whar details to be thrown away from the cel state
* decided by ==Sigmoid Function==
* previous state h-1 and the content input x_t and outputs a number between 0(omit this) and 1 (keep this) for each number in the cell state Ct
* Step 2- INPUT GATE:
* Discover which value from input shud be used to store the memory
* Sigmoid decides which values to let thru (0,1), to decide which values we'll update
* Tanh function create candidate values C~_t which weights the importance ranging of i_t from -1 to 1
* Combine these 2 to create an update to state

* Step 3- REMOVE AND ADD new info to cell state:
* after f_t(what to forget) and i_t(what to add) are decided
* drop the old info and add new info to the cell state
* C_t scaled by how much we decided to update each state value

* Step 4- OUTPUT GATE:
* input and memory of cell state is used to decide the output
* sigmoid decides which values to let through 0,1
* tanh gives weightage to the values to decide level of importance of the memory ranging from -1 to 1 and multiplied with output of Sigmoid

---> Summary
1. RNN/LSTM?GRU is good for ==time series data==
2. LSTM/GRU better than RNN
3. Easy to write using tensorflow but slow to train
### Tokenizer
1. Filter out special characters (!, >, :,...)
2. Get rid of word it doesnt know ---> use
**==Before embedding layer: batch_size, max_len
After bedding layer: batch_size, max_len , embedding_size==**
---
## Time-Series (THU, 14/1/2021)
### Time-Series: A series data points ordered in time
---> How well can we predict?
* How well we understand the factors that contribute to it?
* How much data is available
* The forecast can affect the thing we are trying to forecast

#### ==**Components of a Time Series**==

1. Trend
2. Seasonality
3. Noise (unforseen factors, wrongly influence data---> no way to predict white noise, it's completely random)
#### **==Stationary vs Non-stationary==**
* IF the ==**MEAN**== OR ==**VARIANCE**== changes overtime
* The covariance should not be a function of time (eg.the spread becomes closer as time increases)


==**SEASONALITY IS NON-STATIONARY**==
---> **==Autocorrelation**==: to classify whether data is stationary or not
* Stationary-autocorrelation is high, curve in autocorrelation is **==seasonality lag==**
#### Applications:
Stats, pattern recognition, mathematical finance, econometrics, weather, earthquake prediction, astronomy,...
#### **==Time series forecasting for imputed data==**
## **==REVIEW==**
structured-unstructured
deep learning good for unstructured
### **CNN**:
pooling: input, pool_size, stride, output
input, CNN (f,s)---> output
==Overfitting==: val acc is lower than train acc
---> more data, less complex, earlystop, refularization, dropout, augmentation,...
==Underfitting==:
---> train, more complex, less regularization
Validation: ==unseen data==
Training:
Test: assess the performance ==unseen data==
### ImagedataAugmentation (in memory to avoid overflowing the disk)
---> more variation of data, harder to learn data
### Transfer learning: exclude top layers
Drop out: rate is high--->accuracy goes down
Weight, bias initialization:: bias can be zero, weight cannot!!
weight=0 ---> symmertric problem ---> FIX PROBLEM: set different to zero, small number
==FUCKIN BIG weight==---> ==vanishing gradient problem==, eg sigmoid function approach zero
Batch learning: b=m
Mini batch learning: b=32
Sochastic learning: b=1
dimentionality reduction PCA: curse of dimensionality
Activation function: graphs
Confusion matrix
Loss function:
* Regression: MSE, MAE, RMSE
* Classification: binary cross entropy, cross entropy
Primary requirements:
Secondary requirements:
Human level performance: as good as human
to improve model:
* complex model
* decrease regularization
### RNN
**many to one**: predict next work, sentiment analysis.
**many to many**: name entity recognition, part of speech tagging(pos tagging)
## NLP
word embedding:
* glove: 50, 100, 300, a
* vocab_size: m (10,000)
* ---> m>>a
use glove: extract unknown word---> neutral