---
tags: Master
---
# DEEP
## 8 - CNN
- supervised learning
- deal with locally structured data (images, language, etc VS arbitrary input features)
- learn hierarchy of features
- each layer extracts features from output of previous
- train all layers jointly
- convolution is inspired by nature
- working woth inputs of variable size
two cell types:
- simple cells: respond to edge-like patterns
- complex cells: larger receptive fields, locally invariant to exact position of pattern
- convolution is a two function operation:
- input function
- kernel function
- output of convolution is referred to as "feature map"
- convolution is a filter operation for images
- a kernel matrix is applied to the image
- kernel size is arbitrary
- different size different output
- different kernel different feature extracted
- determining the value of a central pixel by adding weighted values of its neighbours
- CNNs are multi-layer neural networks
- **local connectivity**: neurons in l2 are connected to few neurons of l1
- **share weight params across spatial positions (kernels' weights)**:
- learning shift-invariant filter kernels
- reduce number of params (compared to fully connected)
### CNN architecture
- Multiple layers of feature extractor
- low-level layer extract local features
- high-level extract global patterns
- Typical classification
- input is a matrix or vector
- output is class prediction
- Some core operations
- Convolution
- main layer
- set of filters
- every filter is applied giving a multi-dimensional feature map
- filters are learnend in a supervised manner by backpropragation on classification error
- with N as input dimension and K as kernel dimension and S as stride the dimension of the feature map is $(N-K)/S +1$
- to have same dimension input and output use zero padding
- Non-linearity
- apply some activation function (sigmoid, relu, tanh, etc)
- increase non-linearity
- Pooling
- non-linear down-sampling (max, avg, ...)
- reduce spatial size of rappresentation
- reduce number of parameters and computation
- reduce overfitting
- Convolution + non-l + pooling = "layer"
## 9 - CNN 2
1. **AlexNet**
- similar to LeNet
- convulutional layers followed by fully conncted layers
- way bigger (~60M params)
- more data (10^6 images)
- GPU speed
- dropout and data augmentation
- win IMGNET challenge
- 4096 feature layer
- it learns **general features** (shown by plotting t-SNE of feature layer)
- **better than external feature extractor module**
- works also to feature-extract on other tasks
2. ILSVRC 2013: **Zeiler and Fergus**
- similar to AlexNet
- **huge work to understand internals of CNNs**
- use a second **deconvolutional** network to produce image of layer
- start from feature map
- unpool: max not reversible => keep memory of max locations
- rectify: use ReLU
- filter: inverse of convolution is transposed convolution
- useful for adversarial examples
- can also generate class model visualization (maximize for given class)
- if applied to early layers u get low level patterns
- higher in the hierarchy u get complex patterns (faces, text etc)
3. VGGNet
- similar to AlexNet
- same structure but much deeper
4. Google Net
- no fully connect layers
- 22 layers
- only 4M paramaters
- wider convolution
- parallels path with different receptive field sizes
- 1 by 1 convolution
- intermidiate softmax and loss to better train the network even in mid layers
5. ResNet
- 150 layers
- only 2M paramaters
- small kernels (1x1 or 3x3)
- add identity bypass
- connections that skip a layer if not needed (residual connections/ residual block)
- too many layers without residual gives poor performance
- not because of vanishing gradient
- not because of overfitting
- not because of rappresentation power (more layers -> more powerfull)
- not because of Obi-Wan
- generally the gradient flows better, training faster and better
- it resemble a more shallow model (easier to train)
- batch normalization
- with many levels is easy to halve vanishing gradient
- this help to solve that
- General principles derived:
- reduce filter size
- small kernels
- skip connections (residual blocks)
6. Resnet mods
- pre-activation resNet
- original components different order
- batch norm and activation (ReLU) before convolution
- more direct path
- wide resnet
- same block but more feature maps
- Dense net
- every layer is connect to every layer
- input of a layer are every feature maps of all previous layers
- encourage feature reuse
## 10 - CNN 3
### Object proposal - selective search
- detect objects at any scale
- grouping criteria
- detect differences in color, texture etc
- needs to be fast
- bottom-up grouping of images
- unsupervised clustering
- initial segmentation (lots of fractured regions)
- some iterations of aggregation
- at each iteration search candidate objects in the segmentations
### Feature learning with CNN
1. R-CNN
- region proposal (selective search) + CNN
- use selective search to find good object candidate
- use these bounding boxes as input for pre trained cnn (alexNet)
- imageNet is fast so u can use lots of candidate for each iteration (different scales)
- 2k candidate for image
- inputs needs to have fixed dimension
- necessary to scale, crop etc bounding boxes
- issue: multi stage training
- fine tuning on current dataset
- train classifier (eg SVM)
- refining stage, bound-box regression
- no learing for selective search (fixed algorithm)
- issue: for one image i need to process lots of proposal
- problem on memory
- problem on computational power
- issue: test is also slow
- find proposal
- test 2k proposal
2. Fast R-CNN
- design to be faster
- still using selective search
- add a new layer applied to each region of interest
- RoI pooling layer
- reshape region proposal
- no need to use each box as input for cnn but just the image once
- feature map for each object proposal
- optimization is a trade off between classification and regression loss
- regression loss is bounding box regression loss
3. faster R-CNN
- detection network propose objects
- no selective search
- instead new network to predict region proposal
- higher performance
- because better training
4. Mask faster R-CNN
- pixel level segmentation
- similar to faster R-CNN but:
- after identifing region of interest pass trough a RoI alligment layer
- more advanced than simple Roi layer
- finds multiple bounding boxes and warpe them into a fixed dimension
- as faster R-CNN this is used for the classification and regression error
- the output of the RoI A layer fed to 2 CNN to output a binary mask for each RoI
- profit???
5. YOLO - you only look once
- different approach
- regression
- one convolutional network to predict both boxes and class for box
- divide image into a grid SxS
- m bounding box for each element of grid
- class probablity for each box
- only boxes with probability higher than a treshhold are use to identify objects
- much faster than previous approaches
- issue: spatial constraints
- difficult to detect small objects in the image
### Segmentation
- pixel level prediction
- find depth estimation of image per pixel
- find pixels that define border of objects
- fully convolutional networks
- output is not a class
- output is an image with same dimension of the original
- upsampling layers to obtain this output
- a form of deconvolution
- basically an heatmaps that higlight some objects
- this network can be training in a supervised manner
## 11 - RNN
Problem: previous solutions have fixed size input and output
Feed-forward nets are limited to single-2-single mapping
Does not work for document classification, sentiment analysis, image captioning
Solution: keep hidden memory state
Allows more mappings:
- multiple-2-single (document classif, sentiment anal)
- single-2-multiple (img capt.)
- sequence-2-sequence/mult-2-mult (translation, video frame prediction)
Two representations:
- rolled: more compact
- unrolled: easier for backpropagation
### Vanilla RNN cell

- uses tanh: derivative is simple but suffers from vanishing gradient
- repeat vanilla: 
### Backpropagation Through Time (BPTT):
- most common method for RNNs
- similar to traditional backpropagation:
- treat unfolded RNN as a single feed-forward network (**forward pass**)
- calculate weight updates for each cell copy in the unfolded network, then sum or average weight updates and apply to the RNN weights (**backward pass**)
Problems:
- long chains means a lot of derevatives => vanishing or exploding gradient
- can be computationally expensive as number of timesteps increases
### Truncated BPTT
Better version of BPTT
- split chain in segments of length k1
- backprop
- move back of k2 timesteps
- repeat
### LSTM cell
Idea: add a **memory cell** that is not subject to matrix multiplication or squishing, thereby avoiding gradient decay
Composed of:
- **neuron with self-recurrent connection**: state can remain constant
- **gates** to modulate interactions with environment
- **input**: allow or block change to the current state
- **output**: allow an effect on other neurons
- **forget**: modulate self-recurrent connection, allowing to remember/forget previous state as needed
### GRU cell
- **reset gate**: modulate previous hidden state
- **update gate**: give weight to new value compared to previous
### Multi-layer RNNs
...
### Bi-directional RNNs
## Generative Models
- Unsupervised aproach
- learn hidden structure of the data
### Autoencoders
- from input data x obtain a vector z of features
- z usually smaller than x
- want the feature to be meaningful
- train such that z can be used to reconstruct the original data
- encode . decode image
- can compute regression loss (no labels) to train the model
- eg ||x - x^||2
- x^ is the image decoded
- x is the original image
- after training the model you can discard the decoder part and initialize a supervised model using z
- there are ways to better obtaining z
- eg add noise to the original image
- autoregressive models
- given a training set with a some distribution i want to obtain new samples with that distribution
- pdata -> distribution of training data
- pmodel -> distribution that i want (similar to pdata)
-there are 2 branches
- explicit density estimation: find pmodel and sample data
- implicit: sample without defining pmodel (variational)
### pixel RNN
- explicit density model
- basic idea: the probability of a pixel depends on every pixel before him

- simply maximize this likelihood
- this aproach becomes a sequence problem
- start from the top left corner, each pixel is generated one at the time, row by row
- dependency of previous pixel is determined with a two-dimensional LSTM
- how to compute the probability of a pixel is the context
- orizzonataly
- diagonaly
- ..
- different approach: proceed with diagonal neural network
- pixel depends on the pixel before him in the same column plus the one before him in the same row
- the context is important becauseit affects the computational power required
- with the diagonal approach you can parallelize (compute the pixel above and before)
- this is faster and more efficient
- main drowback is that is a sequential generation for each pixel
- particularly good in image completition
### pixel CNN
- same problem of pixel RNN
- using convolutional layers for context
- a pixel is determined by the pixels around him
- start from top left corner
- trainging is faster
- parallelize convolutions
- kernels are learned during training
- generation is still sequential
- slow
- particularly good in image completition
- some improvements over the years
- dropout
- gated convlutionals layer
- ...
- pixel CNN++
- logistic loss for training
### variational autoencoders
- implicit model
- distribution is not computed
- variational approximation of the distribution
- lower bound
- basic idea: image x^ is generated form latent factor z
- z has a distribution p(z)
- so x^ has a distribution p(x|z)
- the distribution of z is set to something simple (gaussian)
- find optimal paramaters teta for these 2 distribution
- problem:


- we can simply find another distribution q(z|x) close to p(z|x)
- this is the distribution of a certain encoder
- remember that these two models (encoder q(z|x) and encoder p(x|z)) are probabilistics
- we can only find the paramaters of these distributions
- we can use the distributions and the optimal parameters to sample what we need
- q(z|x) -> gives z
- p(x|z) -> gives x (generates image)
- considering the log likelihood we obtain:

- last element of the equation is intractable
- compute only the lower bound (consider first two terms and put the last >= 0)
- first term: maximize likelihood of image being reconstructed
- second term: KL distance, make encoder distribution close to prior
- the generation of the data depends strictly on the elements of z (z1, z2..) sampled from the distribution
- changing z the output changes (person who smiles/angry, number 0-9 etc)
- if u havel labels nothing changes
- simply add y to the distributions
- p(x|z) -> p(x|z,y)
- in the end generates blurry images
- FOR GOD SAKE DON'T USE THIS MESS
## GAN
- No explicit function considered
- focus on ability to sample from model
- sample from high dimensionality training distribution is hard
- sample from simple distribution
- random noise
- transfor to training distribution with neural net
- two networks
- generator
- try to fool the discriminator generating real like images
- discriminator
- distinguish between real and fake images
- train the networks jointly

- discriminator wants D(x) close to 1 (real) and D(G(z)) close to 0 (fake)
- maximization
- gradient ascent
- generator wants D(G(z)) close to 1 (generated image real)
- minimization
- gradient descent
- problem: when image is fake the gradient given by the function is close to 0 (flat)
- difficult to train generator into generating better samples
- change into gradien ascent on generator too
- separate in 2 parts
- for discriminator same as before
- for generator

- Full description:

### DCGAN
- first gan were fully connected
- deep convolutional
- able to generate high resolution images
- Discriminator:
- no pooling
- leaky ReLU for activations
- one fully connected before softmax output
- batch normalization
- even for generator
- its possible to perform aritmetics on latent raprensetations
- sum or difference of latent rappresentations gives new Z
- use z to produce new image
### evaluating GAN
- too accurate to be tested by real people
- problem: density function not known (differently from pixels and variational)
- Inception Score
- a GAN needs to satisfy 2 things
- Saliency
- human needs to determine whats is in the image
- diversity
- the set of created images needs to have lots of different objects

- p(y|x) is the posterior given by a classifier (eg inceptionNet, human)
- if the classifier is able to understand the image the entropy of this is low
- p(y) distribution of classes
- if there are lots of objects the entropy of this is high
- problem:
- overfitting on training score well
- a single image for class score well (mode dropping)
- Frechet Inception Distance
- compare distributions of real data and hgenerated images passed to a classifier (InceptionNet etc)
- correlated with visual quality of sample
- same overfitting problem
- in general difficult to train gan
- generator loss not correlates to quality of images
- very sensible to hyper paramaters
- change the loss function to improve
### Mode collapse
- generator ends up modelling small subset of training data
- the idea is that if the generator is able to fool the discriminator with a certain image will always produce the same image over and over
- the best image to fool the descriminator changes over time
- the network is lock in a cycle and does not converge
- Feature Matching:
- add a new loss for feature matching
- new goal: beat the opponent while matching features of real images
- Minibatch Discriminator
- append a vector to the dense layer of descriminator to measure similarities between images
- if the similarity o(X) is to big we are in mode collapsing
- similar images are given over and other
- the discriminator is able to recognize mode collapsing and penalize the opponent
- Virtual Batch Norm
- with batch normalization generated images are not indipendent
- compute a reference batch before training
- parameters are computed on reference abtch and used on current batch for normalization
- higher computational complexity
- Label smoothing
- discriminator does not output only 1 or 0 (real, fake)
- smooth output between 0 and 1 (0.9 for real)
- better to compute gradient
- Label information
- modify architecture to use labels
- labels as input for both generator and descriminator
- with z and y to generator produce an image x
- then x with his label y is given to the discriminator
- same for real images with their y
### Conditional GAN
- same loss as original GAN
- minimization and maximization
- use labels
- labels as extention of latent space
- D(x) -> D(x|y)
- G(z) -> G(z|y)
### Info Gan
- label not given, treated as latent factor or code
- latent code define the object rappresented in the image
- noise affect modification of that object
- code define object as face
- noise define illumiation or pose etc
- I(c,x) measure how much we know c if we know x
- needs to be maximize
- cost function becomes:

- old function minus the mutual information I
- impossible to optimize mutual information cannot be optimized directly
- lower bound
- input in the generator are both noise z and code c
- output of the discriminiator are both the real/fake image and Q(c|X)
- this probability is used to train the network
- the latent code c should not be lost through the generator discriminator process
- we need to maximize the mutual information given Q as the discriminator output
### Wasserstein GAN
- change cost function
- too easy to train the discriminator respect to the generator
- Kulbach leibler has problem with vanishing gradiant
- when discriminator reaches optimal training the KL reaches a flat zone
- updating the gradiant to difficult in this case
- impossible to effectively tharin the generator
- introduce the Wasserstein distance
- not really a distance
- not prone to vanishing gradient
- better training
- more related to quality of images produced
### LSGAN
- use least square loss for both discriminator and generator
- better quality of images
- more robust to mode collapsing
- it was pretty simple guys
### self attention GAN
- self attention for both generator and discriminator
- details are generated using cues from all feature locations
- discriminator checks distant portions of the image are consistent with each oter
- ???
### Cycle Gan
- giving a dataset of images some of them are paired (location in summer, same location in winter)
- objective is to derive which images are paired
- 3 networks
- G converts real image x to y
- F reconstruct image from y to x^
- mean square between x and x^ is used to train these 2 networks
- discriminator is also used
- process is applied in both directions
- 
## Reinforcement Learning
- agent interacting with environment to get a reward
- reward are scalr values
- take action to maximize reward
- from state s, take action a, determined by policy pi(s)
- slect new state s' based on P(s|a)
- get reward r(s'), update policy
- loop
- action affect the environment
- rewards are noit differentiable
### Markov Decision Process
- framework to make decision in stochastic environment
- stochastic: each move has a certain probability
- no sure move
- goal: find a policy
- map that gives optimal action for each state
- we need dynamic programming: bellman equation
- components:
- states s
- first state s0
- action a
- transition model P(s'|s,a)
- reward function r(s)
- policy pi(s) -> function which maps to each state an action
- y discount factor
- comulative reward is sum of al rewards for each state
- this summation can be infinite
- use discount to stop at some point
- rmax / 1-y is bound of the comulative reward
- the algorithm always converge
- Probability of going to s' depends only on previous state s
- Goal: pind pi* that maximize comulative reward
- Vpi(s) is the value function of state s respect to pi
- expected comulative reward form state s following pi
- V*(s) is the optimal value of state
- value if following best policy pi
- Regular Bellman equation:

- from a state there are more possible states (sthocastic)
- there is a recursive relation
- hard to compute
- optimal policy is the action a which the summation above
- to compute we need the transition model P(s'|s,a)
- better to define a state-action pair Q
- Qpi(s,a) is the same as V but considering action a
- Q*(s,a) is optimal value if following best policy

- find the optimal Q value to then find the optimal policy

- this in practice gives a table
- at each state, at each action we have a certain reward

- the table is known only if the agents tries the action
- initialy set to all 0
- from a state s try an action a
- compute maximum reward of Q(s,a)
- repeat until you have the whole table
- now you can find best policy
### Deep Q Learning
- Q*(s,a) is possible only in low spaces
- atari games too complex and so on
- aproximate with a parametric function Qw(s,a)

- stocastic gradiant descent training:
- replace expentextion by simply sampling
- sample (s,a,s') using behaviour distribution and transition model
- training is problematic
- targets are moving
- policy may change rapidly by slightly changing the paramaters
- drastic change in data distribution
- solution
- freeze target Q
- keep paramaters of Qw fixed, update only every once in a while
- experience replay
- take action at according to epsilon greedy policy (st, at, rt+1, st+1)
- small probability: random action
- otherwise: best action according to current policy
- store expierence in memory buffer
- randomly sample mini-batch of experience
- update parameters
### Policy Gradient Methods
- dont use Q values to rappresent policy
- parametrize policy pi directly
- learn a function giving distribution of action a from state s

- "The gradient of an expectation is transformed to an expectation of gradients, so we can sample using Monte Carlo"
**Finding the policy gradient:**


- we do not need to know about the environment dynamics *p* => **Model-Free Algorithm**
- two options:
- estimate gradient using N trajectories
- single-step: in state s, sample action a using current policy, get reward, update and repeat
- suffer from high variance and low convergence (due to stochastic gradient)