---
# System prepended metadata

title: WEEK 8 (11-15/1/2021)

---

# WEEK 8 (11-15/1/2021)
## Convolutional Neural Network (MON, 11/1/2021)
Computer vision problems:
* Single object: classification, classification+localisation(where is the object)
* Multiple objects: object detection, instance segmentation(where object is in terms of pixel value)
* Neural style transfer

Audio problems: music, speech,...

### Dense/fully connected layers Disadvantage:
for large high resolution images that have million feature/pixels, so with the first layer of dense layer---> too many parameters to train---> risk of ==OVERFITTING==

### Convolutional Neural Network
![](https://i.imgur.com/iojGDQS.png)

####==**Convolution Operation**==:
![](https://i.imgur.com/IrsL5cT.png)

---> ==**Summary**==:
* Convolutions are useful layers for processing images
* Scan images to detect useful features
* Are element-wise products and sums

##### **Stride**: how much space the filter will jump to next position while scanning the input (default=1)
---> prevent overfitting by skipping over noise pixels, improve run time 

##### **Padding**:same(shape of the input same as shape of output), valid(no padding)
---> treat all pixels equally important 
![](https://i.imgur.com/RKOeobO.png)
WITH PADDING
![](https://i.imgur.com/hUrBqKv.png)

##### FORMULA TO CALCULATE OUPUT SHAPE
![](https://i.imgur.com/pdpEyYK.png)
Formula to padding=same:
![](https://i.imgur.com/2sZluzU.png)

#### **Max Pooling layer**
![](https://i.imgur.com/fItaw35.png)
---> reduce computational cost, keep the most important features
!!! not popular for ==last== layer

#### **Average Pooling layer**
![](https://i.imgur.com/0863SoT.png)
Average pooling method smooths out the image and hence the sharp features may not be identified when this pooling method is used.

#### **Global average pooling** better than **global ==max== pooling
![](https://i.imgur.com/nqlKH8V.png)
SHAPE: 7x7x5
---> use for ==**LAST**== layer

### CNN advantages
* parameter sharing: Parameter sharing is where we:force sets of parameters to be equal
* sparsity of connection: In each layer, the output value in the hidden layer depends only on a few input values. This helps in training with smaller training sets, and also it is less prone to overfitting.
* Capturing translation invariance: image can shift a bit, doesnt freak the fuck out 

## Applications of CNN
### Localization
![](https://i.imgur.com/QlCsWIZ.jpg)


### Object detection 
Detecting multiple objects by sliding CNN across the image
![](https://i.imgur.com/3phGDg9.png)

### Semantic segmentation 
Each pixel is classified according to the class of the object it belongs to 
![](https://i.imgur.com/fTKBq7s.png)

* Uber-auto driving 
* Image processing (image signal processor-ISP)


==**USUALLY**==: First 2 dimensions of output shrink but third dimension doubles


## Tensorflow Keras cheatsheet (Tue, 12/01/2021)
### ==**ORDER OF MODEL**==:
1. CNN
2. BatchNorm
3. Activation 
4. Drop out at the end 

### Batch normalization
To calculate non_trainable parameters for batch norm: add all params of batchnorm layers, divide by 2

==**Popular pipeline**==
```
train_dataset = train_dataset.map(preprocess).cache().repeat().shuffle(1024).batch(BATCH_SIZE).prefetch(tf.data.experimental.AUTOTUNE)
test_dataset = test_dataset.map(preprocess).batch(BATCH_SIZE).prefetch(tf.data.experimental.AUTOTUNE)
```

split val and train using imagedatagenerator: 
1. Set the right subset: val for val, train for train
2. Set the same seed
---

TUESDAY NOTE IN NOTE

---
## RNN & LSTM (WED, 13/1/2021)

### Feed Forward Networks (DNN)
#### Disadvantages: 
* Assume all inputs (and outputs) are independent to each other 
* Does not work with sequential (time-series) data (e.g. predicting the next word in a sentence, reading books, understanding lyrics)
* Need to understand previous output to make predictions 
* Doesn't share features learned across different positions of text

### RNN (Recurrent Neural Network)
==**Standard RNN**==
![](https://i.imgur.com/wBqwLUz.png)

==**RNN types**==
1. one to one
2. one to many
3. many to one
4. many to many

#### RNN applications
* Speech recognition
* Music generation 
* Sentiment classification 
* DNA sequence analysis
* Machine translation 
* Video activity recognition 
* Name entity recognition 
(**One hot encoder**: quickly turn words into vectors)

##### **==Main problems of RNN==**
##### ==**PROBLEM OF LONG-TERM DEPENDENCIES**==
* Take longer to train, take more resources
* Vanishing gradient problem (worse than DNN) because have to do Backpropagation Through Time (all the way to t=0)
* 
### LSTM (Long Short Term Memory)
![](https://i.imgur.com/gfmBWTi.png)
#### ==**3 main gates**==(regulators)
![](https://i.imgur.com/xUGh3SI.png)
* Step 1- FORGET GATE:
    * discover whar details to be thrown away from the cel state
    * decided by ==Sigmoid Function==
    * previous state h-1 and the content input x_t and outputs a number between 0(omit this) and 1 (keep this) for each number in the cell state Ct
* Step 2- INPUT GATE:
    * Discover which value from input shud be used to store the memory 
    * Sigmoid decides which values to let thru (0,1), to decide which values we'll update 
    * Tanh function create candidate values C~_t which weights the importance ranging of i_t from -1 to 1
    * Combine these 2 to create an update to state
![](https://i.imgur.com/8msa9Cy.png)

* Step 3- REMOVE AND ADD new info to cell state:
    * after f_t(what to forget) and i_t(what to add) are decided
    * drop the old info and add new info to the cell state
    * C_t scaled by how much we decided to update each state value
![](https://i.imgur.com/QFWC3pJ.png)

* Step 4- OUTPUT GATE:
    * input and memory of cell state is used to decide the output 
    * sigmoid decides which values to let through 0,1
    * tanh gives weightage to the values to decide level of importance of the memory ranging from -1 to 1 and multiplied with output of Sigmoid
![](https://i.imgur.com/QnPO9F9.png)

---> Summary 
1. RNN/LSTM?GRU is good for ==time series data==
2. LSTM/GRU better than RNN
3. Easy to write using tensorflow but slow to train 

### Tokenizer 
1. Filter out special characters (!, >, :,...)
2. Get rid of word it doesnt know ---> use

**==Before embedding layer: batch_size, max_len
After bedding layer: batch_size, max_len , embedding_size==**

---
## Time-Series (THU, 14/1/2021)
### Time-Series: A series data points ordered in time 
---> How well can we predict? 
* How well we understand the factors that contribute to it? 
* How much data is available 
* The forecast can affect the thing we are trying to forecast
![](https://i.imgur.com/fLeqxGY.png)


#### ==**Components of a Time Series**==
![](https://i.imgur.com/wGw7MZL.png)

1. Trend
2. Seasonality
3. Noise (unforseen factors, wrongly influence data---> no way to predict white noise, it's completely random)

#### **==Stationary vs Non-stationary==**
* IF the ==**MEAN**== OR ==**VARIANCE**== changes overtime 
* The covariance should not be a function of time (eg.the spread becomes closer as time increases)
![](https://i.imgur.com/Gf4UKwq.png)
![](https://i.imgur.com/MxpuWY8.png)
==**SEASONALITY IS NON-STATIONARY**==

---> **==Autocorrelation**==: to classify whether data is stationary or not
* Stationary-autocorrelation is high, curve in autocorrelation is **==seasonality lag==**

#### Applications:
Stats, pattern recognition, mathematical finance, econometrics, weather, earthquake prediction, astronomy,...

#### **==Time series forecasting for imputed data==**

## **==REVIEW==**
structured-unstructured
deep learning good for unstructured

### **CNN**:
pooling: input, pool_size, stride, output
input, CNN (f,s)---> output

==Overfitting==: val acc is lower than train acc
---> more data, less complex, earlystop, refularization, dropout, augmentation,...

==Underfitting==:
---> train, more complex, less regularization

Validation: ==unseen data==
Training: 
Test: assess the performance ==unseen data==

### ImagedataAugmentation (in memory to avoid overflowing the disk)
---> more variation of data, harder to learn data

### Transfer learning: exclude top layers

Drop out: rate is high--->accuracy goes down

Weight, bias initialization:: bias can be zero, weight cannot!! 
weight=0 ---> symmertric problem ---> FIX PROBLEM: set different to zero, small number

==FUCKIN BIG weight==---> ==vanishing gradient problem==, eg sigmoid function approach zero

Batch learning: b=m
Mini batch learning: b=32
Sochastic learning: b=1

dimentionality reduction PCA: curse of dimensionality

Activation function: graphs

Confusion matrix

Loss function:
* Regression: MSE, MAE, RMSE
* Classification: binary cross entropy, cross entropy

Primary requirements:
Secondary requirements: 

Human level performance: as good as human 
to improve model:
* complex model
* decrease regularization

### RNN 
**many to one**: predict next work, sentiment analysis.

**many to many**: name entity recognition, part of speech tagging(pos tagging)

## NLP 

word embedding: 
* glove: 50, 100, 300, a
* vocab_size: m (10,000) 
* ---> m>>a

use glove: extract unknown word---> neutral