DEEP - HackMD

--- tags: Master --- # DEEP ## 8 - CNN - supervised learning - deal with locally structured data (images, language, etc VS arbitrary input features) - learn hierarchy of features - each layer extracts features from output of previous - train all layers jointly - convolution is inspired by nature - working woth inputs of variable size two cell types: - simple cells: respond to edge-like patterns - complex cells: larger receptive fields, locally invariant to exact position of pattern - convolution is a two function operation: - input function - kernel function - output of convolution is referred to as "feature map" - convolution is a filter operation for images - a kernel matrix is applied to the image - kernel size is arbitrary - different size different output - different kernel different feature extracted - determining the value of a central pixel by adding weighted values of its neighbours - CNNs are multi-layer neural networks - **local connectivity**: neurons in l2 are connected to few neurons of l1 - **share weight params across spatial positions (kernels' weights)**: - learning shift-invariant filter kernels - reduce number of params (compared to fully connected) ### CNN architecture - Multiple layers of feature extractor - low-level layer extract local features - high-level extract global patterns - Typical classification - input is a matrix or vector - output is class prediction - Some core operations - Convolution - main layer - set of filters - every filter is applied giving a multi-dimensional feature map - filters are learnend in a supervised manner by backpropragation on classification error - with N as input dimension and K as kernel dimension and S as stride the dimension of the feature map is $(N-K)/S +1$ - to have same dimension input and output use zero padding - Non-linearity - apply some activation function (sigmoid, relu, tanh, etc) - increase non-linearity - Pooling - non-linear down-sampling (max, avg, ...) - reduce spatial size of rappresentation - reduce number of parameters and computation - reduce overfitting - Convolution + non-l + pooling = "layer" ## 9 - CNN 2 1. **AlexNet** - similar to LeNet - convulutional layers followed by fully conncted layers - way bigger (~60M params) - more data (10^6 images) - GPU speed - dropout and data augmentation - win IMGNET challenge - 4096 feature layer - it learns **general features** (shown by plotting t-SNE of feature layer) - **better than external feature extractor module** - works also to feature-extract on other tasks 2. ILSVRC 2013: **Zeiler and Fergus** - similar to AlexNet - **huge work to understand internals of CNNs** - use a second **deconvolutional** network to produce image of layer - start from feature map - unpool: max not reversible => keep memory of max locations - rectify: use ReLU - filter: inverse of convolution is transposed convolution - useful for adversarial examples - can also generate class model visualization (maximize for given class) - if applied to early layers u get low level patterns - higher in the hierarchy u get complex patterns (faces, text etc) 3. VGGNet - similar to AlexNet - same structure but much deeper 4. Google Net - no fully connect layers - 22 layers - only 4M paramaters - wider convolution - parallels path with different receptive field sizes - 1 by 1 convolution - intermidiate softmax and loss to better train the network even in mid layers 5. ResNet - 150 layers - only 2M paramaters - small kernels (1x1 or 3x3) - add identity bypass - connections that skip a layer if not needed (residual connections/ residual block) - too many layers without residual gives poor performance - not because of vanishing gradient - not because of overfitting - not because of rappresentation power (more layers -> more powerfull) - not because of Obi-Wan - generally the gradient flows better, training faster and better - it resemble a more shallow model (easier to train) - batch normalization - with many levels is easy to halve vanishing gradient - this help to solve that - General principles derived: - reduce filter size - small kernels - skip connections (residual blocks) 6. Resnet mods - pre-activation resNet - original components different order - batch norm and activation (ReLU) before convolution - more direct path - wide resnet - same block but more feature maps - Dense net - every layer is connect to every layer - input of a layer are every feature maps of all previous layers - encourage feature reuse ## 10 - CNN 3 ### Object proposal - selective search - detect objects at any scale - grouping criteria - detect differences in color, texture etc - needs to be fast - bottom-up grouping of images - unsupervised clustering - initial segmentation (lots of fractured regions) - some iterations of aggregation - at each iteration search candidate objects in the segmentations ### Feature learning with CNN 1. R-CNN - region proposal (selective search) + CNN - use selective search to find good object candidate - use these bounding boxes as input for pre trained cnn (alexNet) - imageNet is fast so u can use lots of candidate for each iteration (different scales) - 2k candidate for image - inputs needs to have fixed dimension - necessary to scale, crop etc bounding boxes - issue: multi stage training - fine tuning on current dataset - train classifier (eg SVM) - refining stage, bound-box regression - no learing for selective search (fixed algorithm) - issue: for one image i need to process lots of proposal - problem on memory - problem on computational power - issue: test is also slow - find proposal - test 2k proposal 2. Fast R-CNN - design to be faster - still using selective search - add a new layer applied to each region of interest - RoI pooling layer - reshape region proposal - no need to use each box as input for cnn but just the image once - feature map for each object proposal - optimization is a trade off between classification and regression loss - regression loss is bounding box regression loss 3. faster R-CNN - detection network propose objects - no selective search - instead new network to predict region proposal - higher performance - because better training 4. Mask faster R-CNN - pixel level segmentation - similar to faster R-CNN but: - after identifing region of interest pass trough a RoI alligment layer - more advanced than simple Roi layer - finds multiple bounding boxes and warpe them into a fixed dimension - as faster R-CNN this is used for the classification and regression error - the output of the RoI A layer fed to 2 CNN to output a binary mask for each RoI - profit??? 5. YOLO - you only look once - different approach - regression - one convolutional network to predict both boxes and class for box - divide image into a grid SxS - m bounding box for each element of grid - class probablity for each box - only boxes with probability higher than a treshhold are use to identify objects - much faster than previous approaches - issue: spatial constraints - difficult to detect small objects in the image ### Segmentation - pixel level prediction - find depth estimation of image per pixel - find pixels that define border of objects - fully convolutional networks - output is not a class - output is an image with same dimension of the original - upsampling layers to obtain this output - a form of deconvolution - basically an heatmaps that higlight some objects - this network can be training in a supervised manner ## 11 - RNN Problem: previous solutions have fixed size input and output Feed-forward nets are limited to single-2-single mapping Does not work for document classification, sentiment analysis, image captioning Solution: keep hidden memory state Allows more mappings: - multiple-2-single (document classif, sentiment anal) - single-2-multiple (img capt.) - sequence-2-sequence/mult-2-mult (translation, video frame prediction) Two representations: - rolled: more compact - unrolled: easier for backpropagation ### Vanilla RNN cell ![](https://i.imgur.com/ItLlgmb.png) - uses tanh: derivative is simple but suffers from vanishing gradient - repeat vanilla: ![](https://i.imgur.com/erFKwAq.png) ### Backpropagation Through Time (BPTT): - most common method for RNNs - similar to traditional backpropagation: - treat unfolded RNN as a single feed-forward network (**forward pass**) - calculate weight updates for each cell copy in the unfolded network, then sum or average weight updates and apply to the RNN weights (**backward pass**) Problems: - long chains means a lot of derevatives => vanishing or exploding gradient - can be computationally expensive as number of timesteps increases ### Truncated BPTT Better version of BPTT - split chain in segments of length k1 - backprop - move back of k2 timesteps - repeat ### LSTM cell Idea: add a **memory cell** that is not subject to matrix multiplication or squishing, thereby avoiding gradient decay Composed of: - **neuron with self-recurrent connection**: state can remain constant - **gates** to modulate interactions with environment - **input**: allow or block change to the current state - **output**: allow an effect on other neurons - **forget**: modulate self-recurrent connection, allowing to remember/forget previous state as needed ### GRU cell - **reset gate**: modulate previous hidden state - **update gate**: give weight to new value compared to previous ### Multi-layer RNNs ... ### Bi-directional RNNs ## Generative Models - Unsupervised aproach - learn hidden structure of the data ### Autoencoders - from input data x obtain a vector z of features - z usually smaller than x - want the feature to be meaningful - train such that z can be used to reconstruct the original data - encode . decode image - can compute regression loss (no labels) to train the model - eg ||x - x^||2 - x^ is the image decoded - x is the original image - after training the model you can discard the decoder part and initialize a supervised model using z - there are ways to better obtaining z - eg add noise to the original image - autoregressive models - given a training set with a some distribution i want to obtain new samples with that distribution - pdata -> distribution of training data - pmodel -> distribution that i want (similar to pdata) -there are 2 branches - explicit density estimation: find pmodel and sample data - implicit: sample without defining pmodel (variational) ### pixel RNN - explicit density model - basic idea: the probability of a pixel depends on every pixel before him ![](https://i.imgur.com/mwn0qon.png) - simply maximize this likelihood - this aproach becomes a sequence problem - start from the top left corner, each pixel is generated one at the time, row by row - dependency of previous pixel is determined with a two-dimensional LSTM - how to compute the probability of a pixel is the context - orizzonataly - diagonaly - .. - different approach: proceed with diagonal neural network - pixel depends on the pixel before him in the same column plus the one before him in the same row - the context is important becauseit affects the computational power required - with the diagonal approach you can parallelize (compute the pixel above and before) - this is faster and more efficient - main drowback is that is a sequential generation for each pixel - particularly good in image completition ### pixel CNN - same problem of pixel RNN - using convolutional layers for context - a pixel is determined by the pixels around him - start from top left corner - trainging is faster - parallelize convolutions - kernels are learned during training - generation is still sequential - slow - particularly good in image completition - some improvements over the years - dropout - gated convlutionals layer - ... - pixel CNN++ - logistic loss for training ### variational autoencoders - implicit model - distribution is not computed - variational approximation of the distribution - lower bound - basic idea: image x^ is generated form latent factor z - z has a distribution p(z) - so x^ has a distribution p(x|z) - the distribution of z is set to something simple (gaussian) - find optimal paramaters teta for these 2 distribution - problem: ![](https://i.imgur.com/wYSInDB.png) ![](https://i.imgur.com/7KKpKA1.png) - we can simply find another distribution q(z|x) close to p(z|x) - this is the distribution of a certain encoder - remember that these two models (encoder q(z|x) and encoder p(x|z)) are probabilistics - we can only find the paramaters of these distributions - we can use the distributions and the optimal parameters to sample what we need - q(z|x) -> gives z - p(x|z) -> gives x (generates image) - considering the log likelihood we obtain: ![](https://i.imgur.com/QAqXzQf.png) - last element of the equation is intractable - compute only the lower bound (consider first two terms and put the last >= 0) - first term: maximize likelihood of image being reconstructed - second term: KL distance, make encoder distribution close to prior - the generation of the data depends strictly on the elements of z (z1, z2..) sampled from the distribution - changing z the output changes (person who smiles/angry, number 0-9 etc) - if u havel labels nothing changes - simply add y to the distributions - p(x|z) -> p(x|z,y) - in the end generates blurry images - FOR GOD SAKE DON'T USE THIS MESS ## GAN - No explicit function considered - focus on ability to sample from model - sample from high dimensionality training distribution is hard - sample from simple distribution - random noise - transfor to training distribution with neural net - two networks - generator - try to fool the discriminator generating real like images - discriminator - distinguish between real and fake images - train the networks jointly ![](https://i.imgur.com/ssHJ7vi.png) - discriminator wants D(x) close to 1 (real) and D(G(z)) close to 0 (fake) - maximization - gradient ascent - generator wants D(G(z)) close to 1 (generated image real) - minimization - gradient descent - problem: when image is fake the gradient given by the function is close to 0 (flat) - difficult to train generator into generating better samples - change into gradien ascent on generator too - separate in 2 parts - for discriminator same as before - for generator ![](https://i.imgur.com/k6aKeri.png) - Full description: ![](https://i.imgur.com/20SD6QE.png) ### DCGAN - first gan were fully connected - deep convolutional - able to generate high resolution images - Discriminator: - no pooling - leaky ReLU for activations - one fully connected before softmax output - batch normalization - even for generator - its possible to perform aritmetics on latent raprensetations - sum or difference of latent rappresentations gives new Z - use z to produce new image ### evaluating GAN - too accurate to be tested by real people - problem: density function not known (differently from pixels and variational) - Inception Score - a GAN needs to satisfy 2 things - Saliency - human needs to determine whats is in the image - diversity - the set of created images needs to have lots of different objects ![](https://i.imgur.com/rvrLArH.png) - p(y|x) is the posterior given by a classifier (eg inceptionNet, human) - if the classifier is able to understand the image the entropy of this is low - p(y) distribution of classes - if there are lots of objects the entropy of this is high - problem: - overfitting on training score well - a single image for class score well (mode dropping) - Frechet Inception Distance - compare distributions of real data and hgenerated images passed to a classifier (InceptionNet etc) - correlated with visual quality of sample - same overfitting problem - in general difficult to train gan - generator loss not correlates to quality of images - very sensible to hyper paramaters - change the loss function to improve ### Mode collapse - generator ends up modelling small subset of training data - the idea is that if the generator is able to fool the discriminator with a certain image will always produce the same image over and over - the best image to fool the descriminator changes over time - the network is lock in a cycle and does not converge - Feature Matching: - add a new loss for feature matching - new goal: beat the opponent while matching features of real images - Minibatch Discriminator - append a vector to the dense layer of descriminator to measure similarities between images - if the similarity o(X) is to big we are in mode collapsing - similar images are given over and other - the discriminator is able to recognize mode collapsing and penalize the opponent - Virtual Batch Norm - with batch normalization generated images are not indipendent - compute a reference batch before training - parameters are computed on reference abtch and used on current batch for normalization - higher computational complexity - Label smoothing - discriminator does not output only 1 or 0 (real, fake) - smooth output between 0 and 1 (0.9 for real) - better to compute gradient - Label information - modify architecture to use labels - labels as input for both generator and descriminator - with z and y to generator produce an image x - then x with his label y is given to the discriminator - same for real images with their y ### Conditional GAN - same loss as original GAN - minimization and maximization - use labels - labels as extention of latent space - D(x) -> D(x|y) - G(z) -> G(z|y) ### Info Gan - label not given, treated as latent factor or code - latent code define the object rappresented in the image - noise affect modification of that object - code define object as face - noise define illumiation or pose etc - I(c,x) measure how much we know c if we know x - needs to be maximize - cost function becomes: ![](https://i.imgur.com/n37jgTK.png) - old function minus the mutual information I - impossible to optimize mutual information cannot be optimized directly - lower bound - input in the generator are both noise z and code c - output of the discriminiator are both the real/fake image and Q(c|X) - this probability is used to train the network - the latent code c should not be lost through the generator discriminator process - we need to maximize the mutual information given Q as the discriminator output ### Wasserstein GAN - change cost function - too easy to train the discriminator respect to the generator - Kulbach leibler has problem with vanishing gradiant - when discriminator reaches optimal training the KL reaches a flat zone - updating the gradiant to difficult in this case - impossible to effectively tharin the generator - introduce the Wasserstein distance - not really a distance - not prone to vanishing gradient - better training - more related to quality of images produced ### LSGAN - use least square loss for both discriminator and generator - better quality of images - more robust to mode collapsing - it was pretty simple guys ### self attention GAN - self attention for both generator and discriminator - details are generated using cues from all feature locations - discriminator checks distant portions of the image are consistent with each oter - ??? ### Cycle Gan - giving a dataset of images some of them are paired (location in summer, same location in winter) - objective is to derive which images are paired - 3 networks - G converts real image x to y - F reconstruct image from y to x^ - mean square between x and x^ is used to train these 2 networks - discriminator is also used - process is applied in both directions - ![](https://i.imgur.com/6n1a9IN.png) ## Reinforcement Learning - agent interacting with environment to get a reward - reward are scalr values - take action to maximize reward - from state s, take action a, determined by policy pi(s) - slect new state s' based on P(s|a) - get reward r(s'), update policy - loop - action affect the environment - rewards are noit differentiable ### Markov Decision Process - framework to make decision in stochastic environment - stochastic: each move has a certain probability - no sure move - goal: find a policy - map that gives optimal action for each state - we need dynamic programming: bellman equation - components: - states s - first state s0 - action a - transition model P(s'|s,a) - reward function r(s) - policy pi(s) -> function which maps to each state an action - y discount factor - comulative reward is sum of al rewards for each state - this summation can be infinite - use discount to stop at some point - rmax / 1-y is bound of the comulative reward - the algorithm always converge - Probability of going to s' depends only on previous state s - Goal: pind pi* that maximize comulative reward - Vpi(s) is the value function of state s respect to pi - expected comulative reward form state s following pi - V*(s) is the optimal value of state - value if following best policy pi - Regular Bellman equation: ![](https://i.imgur.com/bLDfZR1.png) - from a state there are more possible states (sthocastic) - there is a recursive relation - hard to compute - optimal policy is the action a which the summation above - to compute we need the transition model P(s'|s,a) - better to define a state-action pair Q - Qpi(s,a) is the same as V but considering action a - Q*(s,a) is optimal value if following best policy ![](https://i.imgur.com/udFtSqL.png) - find the optimal Q value to then find the optimal policy ![](https://i.imgur.com/YhPaHRO.png) - this in practice gives a table - at each state, at each action we have a certain reward ![](https://i.imgur.com/A8B6Zre.png) - the table is known only if the agents tries the action - initialy set to all 0 - from a state s try an action a - compute maximum reward of Q(s,a) - repeat until you have the whole table - now you can find best policy ### Deep Q Learning - Q*(s,a) is possible only in low spaces - atari games too complex and so on - aproximate with a parametric function Qw(s,a) ![](https://i.imgur.com/Sl5dgHZ.png) - stocastic gradiant descent training: - replace expentextion by simply sampling - sample (s,a,s') using behaviour distribution and transition model - training is problematic - targets are moving - policy may change rapidly by slightly changing the paramaters - drastic change in data distribution - solution - freeze target Q - keep paramaters of Qw fixed, update only every once in a while - experience replay - take action at according to epsilon greedy policy (st, at, rt+1, st+1) - small probability: random action - otherwise: best action according to current policy - store expierence in memory buffer - randomly sample mini-batch of experience - update parameters ### Policy Gradient Methods - dont use Q values to rappresent policy - parametrize policy pi directly - learn a function giving distribution of action a from state s ![](https://i.imgur.com/dy8yv0q.png) - "The gradient of an expectation is transformed to an expectation of gradients, so we can sample using Monte Carlo" **Finding the policy gradient:** ![](https://i.imgur.com/91jkHm9.png) ![](https://i.imgur.com/u1RnOn4.png) - we do not need to know about the environment dynamics *p* => **Model-Free Algorithm** - two options: - estimate gradient using N trajectories - single-step: in state s, sample action a using current policy, get reward, update and repeat - suffer from high variance and low convergence (due to stochastic gradient)