WEEK 8 (11-15/1/2021)

# WEEK 8 (11-15/1/2021) ## Convolutional Neural Network (MON, 11/1/2021) Computer vision problems: * Single object: classification, classification+localisation(where is the object) * Multiple objects: object detection, instance segmentation(where object is in terms of pixel value) * Neural style transfer Audio problems: music, speech,... ### Dense/fully connected layers Disadvantage: for large high resolution images that have million feature/pixels, so with the first layer of dense layer---> too many parameters to train---> risk of ==OVERFITTING== ### Convolutional Neural Network ![](https://i.imgur.com/iojGDQS.png) ####==**Convolution Operation**==: ![](https://i.imgur.com/IrsL5cT.png) ---> ==**Summary**==: * Convolutions are useful layers for processing images * Scan images to detect useful features * Are element-wise products and sums ##### **Stride**: how much space the filter will jump to next position while scanning the input (default=1) ---> prevent overfitting by skipping over noise pixels, improve run time ##### **Padding**:same(shape of the input same as shape of output), valid(no padding) ---> treat all pixels equally important ![](https://i.imgur.com/RKOeobO.png) WITH PADDING ![](https://i.imgur.com/hUrBqKv.png) ##### FORMULA TO CALCULATE OUPUT SHAPE ![](https://i.imgur.com/pdpEyYK.png) Formula to padding=same: ![](https://i.imgur.com/2sZluzU.png) #### **Max Pooling layer** ![](https://i.imgur.com/fItaw35.png) ---> reduce computational cost, keep the most important features !!! not popular for ==last== layer #### **Average Pooling layer** ![](https://i.imgur.com/0863SoT.png) Average pooling method smooths out the image and hence the sharp features may not be identified when this pooling method is used. #### **Global average pooling** better than **global ==max== pooling ![](https://i.imgur.com/nqlKH8V.png) SHAPE: 7x7x5 ---> use for ==**LAST**== layer ### CNN advantages * parameter sharing: Parameter sharing is where we:force sets of parameters to be equal * sparsity of connection: In each layer, the output value in the hidden layer depends only on a few input values. This helps in training with smaller training sets, and also it is less prone to overfitting. * Capturing translation invariance: image can shift a bit, doesnt freak the fuck out ## Applications of CNN ### Localization ![](https://i.imgur.com/QlCsWIZ.jpg) ### Object detection Detecting multiple objects by sliding CNN across the image ![](https://i.imgur.com/3phGDg9.png) ### Semantic segmentation Each pixel is classified according to the class of the object it belongs to ![](https://i.imgur.com/fTKBq7s.png) * Uber-auto driving * Image processing (image signal processor-ISP) ==**USUALLY**==: First 2 dimensions of output shrink but third dimension doubles ## Tensorflow Keras cheatsheet (Tue, 12/01/2021) ### ==**ORDER OF MODEL**==: 1. CNN 2. BatchNorm 3. Activation 4. Drop out at the end ### Batch normalization To calculate non_trainable parameters for batch norm: add all params of batchnorm layers, divide by 2 ==**Popular pipeline**== ``` train_dataset = train_dataset.map(preprocess).cache().repeat().shuffle(1024).batch(BATCH_SIZE).prefetch(tf.data.experimental.AUTOTUNE) test_dataset = test_dataset.map(preprocess).batch(BATCH_SIZE).prefetch(tf.data.experimental.AUTOTUNE) ``` split val and train using imagedatagenerator: 1. Set the right subset: val for val, train for train 2. Set the same seed --- TUESDAY NOTE IN NOTE --- ## RNN & LSTM (WED, 13/1/2021) ### Feed Forward Networks (DNN) #### Disadvantages: * Assume all inputs (and outputs) are independent to each other * Does not work with sequential (time-series) data (e.g. predicting the next word in a sentence, reading books, understanding lyrics) * Need to understand previous output to make predictions * Doesn't share features learned across different positions of text ### RNN (Recurrent Neural Network) ==**Standard RNN**== ![](https://i.imgur.com/wBqwLUz.png) ==**RNN types**== 1. one to one 2. one to many 3. many to one 4. many to many #### RNN applications * Speech recognition * Music generation * Sentiment classification * DNA sequence analysis * Machine translation * Video activity recognition * Name entity recognition (**One hot encoder**: quickly turn words into vectors) ##### **==Main problems of RNN==** ##### ==**PROBLEM OF LONG-TERM DEPENDENCIES**== * Take longer to train, take more resources * Vanishing gradient problem (worse than DNN) because have to do Backpropagation Through Time (all the way to t=0) * ### LSTM (Long Short Term Memory) ![](https://i.imgur.com/gfmBWTi.png) #### ==**3 main gates**==(regulators) ![](https://i.imgur.com/xUGh3SI.png) * Step 1- FORGET GATE: * discover whar details to be thrown away from the cel state * decided by ==Sigmoid Function== * previous state h-1 and the content input x_t and outputs a number between 0(omit this) and 1 (keep this) for each number in the cell state Ct * Step 2- INPUT GATE: * Discover which value from input shud be used to store the memory * Sigmoid decides which values to let thru (0,1), to decide which values we'll update * Tanh function create candidate values C~_t which weights the importance ranging of i_t from -1 to 1 * Combine these 2 to create an update to state ![](https://i.imgur.com/8msa9Cy.png) * Step 3- REMOVE AND ADD new info to cell state: * after f_t(what to forget) and i_t(what to add) are decided * drop the old info and add new info to the cell state * C_t scaled by how much we decided to update each state value ![](https://i.imgur.com/QFWC3pJ.png) * Step 4- OUTPUT GATE: * input and memory of cell state is used to decide the output * sigmoid decides which values to let through 0,1 * tanh gives weightage to the values to decide level of importance of the memory ranging from -1 to 1 and multiplied with output of Sigmoid ![](https://i.imgur.com/QnPO9F9.png) ---> Summary 1. RNN/LSTM?GRU is good for ==time series data== 2. LSTM/GRU better than RNN 3. Easy to write using tensorflow but slow to train ### Tokenizer 1. Filter out special characters (!, >, :,...) 2. Get rid of word it doesnt know ---> use **==Before embedding layer: batch_size, max_len After bedding layer: batch_size, max_len , embedding_size==** --- ## Time-Series (THU, 14/1/2021) ### Time-Series: A series data points ordered in time ---> How well can we predict? * How well we understand the factors that contribute to it? * How much data is available * The forecast can affect the thing we are trying to forecast ![](https://i.imgur.com/fLeqxGY.png) #### ==**Components of a Time Series**== ![](https://i.imgur.com/wGw7MZL.png) 1. Trend 2. Seasonality 3. Noise (unforseen factors, wrongly influence data---> no way to predict white noise, it's completely random) #### **==Stationary vs Non-stationary==** * IF the ==**MEAN**== OR ==**VARIANCE**== changes overtime * The covariance should not be a function of time (eg.the spread becomes closer as time increases) ![](https://i.imgur.com/Gf4UKwq.png) ![](https://i.imgur.com/MxpuWY8.png) ==**SEASONALITY IS NON-STATIONARY**== ---> **==Autocorrelation**==: to classify whether data is stationary or not * Stationary-autocorrelation is high, curve in autocorrelation is **==seasonality lag==** #### Applications: Stats, pattern recognition, mathematical finance, econometrics, weather, earthquake prediction, astronomy,... #### **==Time series forecasting for imputed data==** ## **==REVIEW==** structured-unstructured deep learning good for unstructured ### **CNN**: pooling: input, pool_size, stride, output input, CNN (f,s)---> output ==Overfitting==: val acc is lower than train acc ---> more data, less complex, earlystop, refularization, dropout, augmentation,... ==Underfitting==: ---> train, more complex, less regularization Validation: ==unseen data== Training: Test: assess the performance ==unseen data== ### ImagedataAugmentation (in memory to avoid overflowing the disk) ---> more variation of data, harder to learn data ### Transfer learning: exclude top layers Drop out: rate is high--->accuracy goes down Weight, bias initialization:: bias can be zero, weight cannot!! weight=0 ---> symmertric problem ---> FIX PROBLEM: set different to zero, small number ==FUCKIN BIG weight==---> ==vanishing gradient problem==, eg sigmoid function approach zero Batch learning: b=m Mini batch learning: b=32 Sochastic learning: b=1 dimentionality reduction PCA: curse of dimensionality Activation function: graphs Confusion matrix Loss function: * Regression: MSE, MAE, RMSE * Classification: binary cross entropy, cross entropy Primary requirements: Secondary requirements: Human level performance: as good as human to improve model: * complex model * decrease regularization ### RNN **many to one**: predict next work, sentiment analysis. **many to many**: name entity recognition, part of speech tagging(pos tagging) ## NLP word embedding: * glove: 50, 100, 300, a * vocab_size: m (10,000) * ---> m>>a use glove: extract unknown word---> neutral