Deep Learning - HackMD

--- tags: Master --- # Deep Learning [TOC] ## Introduction 🖖 ![](https://i.imgur.com/py683Qk.jpg) Image from [Neural Network Zoo](https://www.asimovinstitute.org/neural-network-zoo/?utm_content=buffer607bd&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer) (has overview of each NN) **Deep Learning:** learn with deep neural networks (several hidden layers) - generally better than other ML methods **for structured input/output** - `image`, `speech`, ... - `hidden layers` automatically compute relevant features ## History 🏺 ![](https://i.imgur.com/ZA6Kfqw.png) **Perceptron (**1957**):** a `non-linear` parameterized function with restricted output range. ![](https://i.imgur.com/4pAQFes.png) - 👁‍🗨 the **Electronic Brain (**1943**)** by McCulloch & Pitts is like a Perceptron, but without the bias and threshold function - **Activation Functions:** see on later chapters, traditional `Sigmoid` vs modern `ReLu` :snowflake:**First AI Winter(**1969-1986**):** `XOR` cannot be solved by a perceptron, dark age **Multi-Layer Perceptron (**1986**):** can solve non-linearly separable problems - only `feed-forward` (no cycles) - `Rosenblatt` algorithm for Perceptron learning is not applicable $\implies$ use **Backpropagation** ![](https://i.imgur.com/pQu93L1.png) :snowflake:**Second AI Winter (**1990-2006**):** NN cannot exploit many layers - `Overfitting` problem - `Vanishing gradient` problem - Lack of data and computational power (`no GPU`) **SVM (**1995**):** became very popular ![](https://i.imgur.com/M9bO51m.png) - similar accuracies to Neural Networks - fewer heuristics and parameters **Recurrent Long-Short Term Memory Networks (**1997**)** - Deep Neural Networks with loops **LeNet (**1998**):** first Convolutional Neural Network? ![](https://i.imgur.com/ys2b3Ib.png) **DNN Pre-training (**2006**):** discarded by 2010 due to technological progresses favoring back-propagation from scratch **AlexNet (**2012**):** CNN similar to `LeNet`, but deeper and with stacked convolutional layers - some technical improvements: `ReLu`, `dropout`, `data augmentation` ![](https://i.imgur.com/gKYlpEX.png) ![](https://i.imgur.com/cbC17zN.png) ## ML Recap 🎡 Traditional programming: $\text{data} + \text{program}=\text{output}$ Machine Learning: $\text{data} + \text{output} = \text{program}$ - algorithms on data give **knowledge** ### Types of learning ![](https://i.imgur.com/7Eb3V4a.jpg) **Supervised (inductive) Learning:** given training data + desired outputs (`labels`) - missing inputs **Unsupervised Learning:** only training data (without desired outputs) - anomaly detection - dimensionality reduction **Semi-supervised Learning:** given training data +`few labels` **Reinforcement Learning:** rewards from `sequence of actions` - 🎯Goal: maximize the expected **cumulative reward** ![](https://i.imgur.com/lidaWvr.png) ### Designing a Learning System 1. **Experience:** choose `training data` 2. **Model / Target Function:** choose what to learn 3. **Function representation:** choose how to represent the target function - SUPERVISED: - Instance-based functions - **Non-Parametric** - `K-nearest neighbors`: simple, lazy learner, curse of dimensionality - Kernel regression - Local regression - **Numerical** functions - `Linear regression` - `Neural networks`: $f(\textbf{x};\Theta)$ - `SVMs`: $f(\textbf{x})=b+\sum\limits_i a_ik(\textbf{x},\textbf{x}^{(i)})$ - Probabilistic Graphical Models - **Parametric** - `Naïve Bayes` - `Bayesian networks` - Hidden-Markov Models (`HMMs`) - Probabilistic Context Free Grammars (PCFGs) - Markov networks - **Symbolic** functions - `Decision trees` - Rules in propositional logic - Rules in first-order predicate logic - **Aggregation** - Bagging (bootstrap + aggregation) - Adaboost - Random forests - UNSUPERVISED: - **Clustering** - `K-means` clustering - Spectral clustering - Gaussian mixture model (`GMM`) - **Dimensionality reduction** - Principal component analysis (`PCA`) - Factor analysis 4. **Optimizer:** choose `learning algorithm` to infer the target function from experience - `Gradient descent` - Perceptron learning rule - **Backpropagation** - Dynamic Programming - HMM Learning - PCFG Learning - Divide and Conquer - Decision tree induction - Rule learning - Evolutionary Computation - Genetic Algorithms (GAs) - Genetic Programming (GP) - Neuro-evolution 5. **Evaluation Metric:** define metric to be optimized - Accuracy - Precision & Recall - Mean Squared error (MSE) - Likelihood - Posterior probability - Margin - Entropy - KL divergence - Minimum quantization error - ... 👁‍🗨 **Discriminative vs Generative Models** - **Discriminative:** SVM, CNN, ... - focus on the `decision boundary` - map $\textbf{x} → y$ directly - designed for `supervised data` - **Generative:** Naive Bayes Classifier, Variational Autoencoders, ... - probabilistic model of `each class` - for `unsupervised data` - use of **sampling** ### Training and Testing **Generalization:** ability to perform well on previously unobserved inputs. - in ML we want to `minimize the training error` while keeping the `generalization error` (test eror) low aswell 👁‍🗨`Assumption:` training and test examples are **i.i.d.** (“independent and identically distributed) **Assumptions Violation** - `collective classification:` if examples are not independent - `transfer learning:` if test distribution is different from training **Underfitting & Overfitting:** find good balance to minimize gap between train and test error - `underfitting` $\implies$ training error too big - `overfitting` $\implies$ test error too big ![](https://i.imgur.com/asPGUoK.png) - `Capacity:` ability to fit a wide variety of functions (e.g. polynomial vs linear for regression) **No Free Lunch Theorem:** averaged over all possible data-generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points. - 👁‍🗨Simplification assumptions $\implies$ model bias $\implies$ there are no universally good ML algorithms $\implies$ ML must be designed for a specific task **Regularization:** any modification we make to a learning algorithm that is intended to `reduce its generalization error` but not its training error. - typically a function `on model parameters` - Example: linear regression with high order polynomial & weight decay $\mathcal{L}(\textbf{w})=\text{MSE}_{train}+\lambda\textbf{w}^T\textbf{w}$ ### ML in Practice **Which Algorithm?** - `Size` of dataset. - Is task difficult? (linearly separable?) - `Prior knowledge` of data distribution? - Special requirements about speed? - Model building vs prediction time - Online vs `batch learning` - Care about `interpretability`? **Hyper-parameters:** model settings that cannot be estimated from data (as the model parameters) - help estimate the `model parameters` - e.g.: learning rate, C for SVMs, k for KNN, ... ## Feedforward Networks 🔱 ![](https://i.imgur.com/gRIFgOi.gif) ### Logistic Regression - used when the dependent variable is `categorical` **Maximum Likelihood principle:** given a known distribution function of data, pick parameters $θ$ that maximize the likelihood function - `Likelihood function:` $p(\textbf{x}; \theta)$ the plausibility of the given data - can be generalized to estimate a conditional probability for prediction: $p(\textbf{y}|\textbf{x};\theta)$ ![](https://i.imgur.com/IeZ0Tbo.png) ![](https://i.imgur.com/wv3cQqP.png) - the maximum can be obtained zeroing the gradient wrt to $θ$: $∇_θ \sum\limits_{j=1}^n \text{ln }p(\textbf{x}_j |θ) = 0$ - local or global maxima depending on the form of the distribution **Sigmoid:** $\sigma(z)=\dfrac{1}{1+e^{-z}}$, squash output to range $[0,1]$ **Logistic function (**single neuron**):** ![](https://i.imgur.com/kDo4qbE.png) **Cross-entropy (**log-likelihood**):** ![](https://i.imgur.com/XflHOJZ.png) - jointly concave in all components $\rightarrow$ gradient descent ### Feedforward Networks Also called **multi-layer perceptrons** (`MLP`). ![](https://i.imgur.com/uVrD9af.png) 🎯Goal: `approximate` some unknown **ideal function** $f^*:\mathcal{X}\rightarrow\mathcal{Y}$ - **ideal classifier:** $y=f^*(\textbf{x})$ always predicts the true outcome **Feed-forward network** defines a `parametric mapping` $y=f(\textbf{x},\theta)$ - `optimize` $\theta$ parameters to approximate the ideal $f^*$ from given data - Example: _linear model_ $f^{(i)}(\textbf{x};\textbf{w},b)=\textbf{x}^T\textbf{w}+b$ - $f$ is composition of functions $\rightarrow$ **directed acyclic graph** - $f^{(i)}$ is the $i$-th layer - `Hidden layers:` output of intermediate layers is not specified ![](https://i.imgur.com/g5pdP4y.png) 👁‍🗨Neural Networks: most optimization functions become `non-convex` $\implies$ **no convergence guaranteed**. **Modeling choices:** need to choose - `Cost function` (**cross-entropy**, squared loss, ...) - `Form of output` (e.g.: linear model) - `Activation functions` (ReLu, Tahn, ...) - `Architecture` (hidden units, number of layers,...) - `Optimizer` (gradient descent, ...) ### Cost Functions & Output Units **Output Units:** choice important for choice of cost function - **Linear:** $\hat{y} = \textbf{W}^T \textbf{x} + \textbf{b}$ - often used to produce the mean of a `conditional Gaussian distribution` $p(y|x)=\mathcal{N}(y;\hat{y},I)$ - **Maximize Log-likelihood** $\implies$ **Minimize Squared Error** - 👍linear units do not saturate - 👎output gradient close to $0$ is problematic - **Softmax:** $\text{softmax}(\textbf{z})_i=\dfrac{\text{exp}(z_i)}{\sum_j \text{exp}(z_j)}$ - can also use $\text{log softmax}(\textbf{z})_i=z_i-log\sum_j \text{exp}(z_j)$ - 👍$z_i$ term never saturates, easier learning - maximizing log-likelihood encourages competition ### Hidden Units Steps: 1. accept input $\textbf{x}$ 2. compute affine transformation $\textbf{z} = \textbf{W}^T \textbf{x} + \textbf{b}$ 3. apply **element-wise non-linear function** $g(z)$ 👁‍🗨But what function $g$ to choose? **Activation Functions (**for hidden units**):** introduce `non-linearity` in the network - :warning: activation functions **must be non-linear**, otherwise the network would remain linear ![](https://i.imgur.com/qKemJnb.png) - **ReLu:** :heavy_check_mark: Generalized Rectified Linear Unit, `default starting point` - many modifications exist - 👍large consistent gradients when active (`no saturations`) - 👍`faster optimization` than sigmoid - 👎non-zero centered output - 👎`units die` (no updates when inactive) - **ELU:** Exponential Linear Unit - 👍benefits of ReLU + **does not get killed** - 👎need to exponentiate - **Maxout Units:** generalizes ReLu but does not fit in the (dot product $→$ nonlinearity) mold - **Sigmoid/logistic Units:** ❌ sensitive only near zero, `easily saturated` across the domain - **Tahn Units:** related to Sigmoid, but better - 👍zero-centered outputs $[-1,1]$ - 👎also `saturates` - **SoftPlus:** smooth version of rectifier - 👎performs worse than rectifiers ### Universality Approximation Theorem A **2-layer net with linear output** with some `squashing non-linearity in hidden units` can approximate any continuous function over compact domain to arbitrary accuracy (**given enough hidden units!**) - but does not guarantee that the training algorithm can learn the function How many hidden units for a shallow network? How deep the network? ## Backpropagation ♻ Network optimization brief history: - Idea N.1: `randomly perturb one weight`, if it improves performance, save the change - 👎very inefficient - Idea N.2: `perturb weights in parallel` and correlate the performance gain with weight changes - 👎very inefficient & hard to implement - Idea N.3: `randomly perturb hidden units` $\rightarrow$ fewer activities than weights - 👎still very inefficient - **Backpropagation idea:** compute `how fast error changes` with the change of a hidden activity - 👍easy: get error derivatives for hidden units $\rightarrow$ then get error derivatives for weights incoming to hidden units [...] ## Regularization 👮‍♂️ **Regularization:** any modification we make to a learning algorithm that is intended to `reduce its generalization error` but not its training error. ![](https://i.imgur.com/wPz9hx4.png) 🎯Goal: `avoid overfitting`, perform well on unseen inputs (`generalization`). - Approach 1: **more data** - Approach 2: model with the **right capacity** (expressive level) - Approach 3: **average many models** (e.g.: `bagging`) **Avoid overfitting in neural networks:** usually a combination of the following methods - `Architecture:` limit hidden layers and number of units - `Early stopping` - `Weight-decay:` penalize large weights using penalties or constraints - `Noise:` add noise to weights or activations REGULARIZATION METHODS OVERVIEW: - **LP Regularization (**Tikhonov’s**):** the most common, introduces a penalty that considers the `norm of the parameters` (weights) - $$ \widetilde{L}(\theta)=L(\theta)+\lambda\Omega(\theta) $$ - $\Omega$ is the parameter norm penalty function, **typically penalizes only the weights** - $\lambda$ is the penalty factor (tunable parameter): increase $\implies$➕bias, ➖variance - **L1 (**`Lasso regression`**):** most popular, **can get coefficient to exactly zero**, thus removable from the model - **sparser weights** $\rightarrow$ used a lot for as `feature selection`mechanism - $$ \widetilde{L}(\theta)=\sum\limits_{i=1}^NL(f(x_i,\theta),y_i)+\frac{\lambda}{2}\sum\limits_l||w_l||_1 $$ - passes inside the gradient descent update rule - **L2 (**`Ridge regression`**):** - $$ \widetilde{L}(\theta)=\sum\limits_{i=1}^NL(f(x_i,\theta),y_i)+\frac{\lambda}{2}\sum\limits_l||w_l||^2 $$ - also passes inside the gradient descent update rule ![](https://i.imgur.com/KLnUiFm.png) - **Data Augmentation:** `data is synthesized` from existing data, with various transformations - 👍classification: easiest way to regularize a model, just train it on more data - :heavy_check_mark: very useful for **object recognition**: new transformed data `improves accuracy` - 👎not easily generalized to other problems ![](https://i.imgur.com/vKY5wnM.png) - **Early Stopping:** stop the training right before the validation loss starts to increase - number of training steps as hyper-parameter - **Multi-Task Learning:** generalization by pooling examples out of several tasks (each task can help other tasks) - `task specific parameters:` only benefit from examples of their tasks - `generic parameters:` benefit from the pooled data of all tasks ![](https://i.imgur.com/RDBUTdZ.png) - **Parameter Sharing** certain parameters should be close to one another - 👁‍🗨used a lot in `CNNs`: share parameters across multiple image locations because many image properties are **invariant to translation** ![](https://i.imgur.com/HFNoMx8.png) - **Model Ensembles:** train many models, then make them `vote` on the output for test examples - `Bagging & Boosting:` 👎time and memory expensive ![](https://i.imgur.com/uuOyqNt.png) - **Dropout:**:heavy_check_mark: reduces network redundancy by breaking NNs co-adaptation tendency (ndr: not unique to this, but here i read about it) ![](https://i.imgur.com/2g7igZs.png) - 👁‍🗨`co-adaptation:` when two or more neurons in the network begin to detect the same feature repeatedly - by randomly zeroing some neurons - `exponential` number of learned models (ensembles) that share parameters - 👍makes it practical to apply _bagging_ to many large networks ![](https://i.imgur.com/hOSrBHi.png) - **Batch Normalization:** similar idea to input normalization, but for hidden layers - regularization by adding some noise to hidden layers activations, like dropout, but less impactful - 👍speeds-up training by a lot ## Optimization 📉 GRADIENT DESCENT OPTIMIZATION SUMMARY: ![](https://i.imgur.com/cdIoJWm.png) ![](https://i.imgur.com/sZrDcd4.png) - where $\odot$ is the **Hadamard product** (element-wise matrix multiplication) ![](https://i.imgur.com/ti34I66.gif) ![](https://i.imgur.com/kIxAXm5.gif) `Local minima:` optimization problem for neural networks in _non-convex_. **Saddle Point:** gradient is $0$ even if not at a minimum, very common in higher dimensions ### Batch Size ![](https://i.imgur.com/ozBdGdj.png) **BGD (**Batch Gradient Descent**):** averages gradients from many inputs - 👍stable gradient estimates - 👎update after computing gradients on whole training ![](https://i.imgur.com/srhVBbE.png) **SGD (**Stochastic Gradient Descent**):** update after each training input - updates faster ![](https://i.imgur.com/y7ottEy.png) **Mini-batch:** in the middle - 👍can set more optimal batch size for own hardware ### Momentum ![](https://i.imgur.com/pL71cbf.png) $$ \textbf{v}\leftarrow\alpha\textbf{v}-\epsilon\nabla_\theta\big(L(f(\textbf{x}^{(i)};\theta),\textbf{y}^{(i)})\big) \quad \alpha\in[0,1) $$ $$ \text{update rule:} \quad \theta\leftarrow\theta+\textbf{v} $$ - $\alpha$ weights the importance of previous gradients for the current update **Momentum:** `accelerates the descent` for dimensions whose gradients point in the same direction. - 👍can help against weak points of SGD (high curvature, small gradients, noisy gradients, ...) - 👁‍🗨but can also skip a local minimum (tends to `overshoot`) **Nesterov Accelerated Momentum (**`NAG`**):** first update the parameters (momentum step), then calculate the gradient - 👍helps with over-shooting $$ \textbf{v}\leftarrow\alpha\textbf{v}-\epsilon\nabla_\theta\big(L(f(\textbf{x}^{(i)};\theta+\alpha\textbf{v}),\textbf{y}^{(i)})\big) \quad \alpha\in[0,1) $$ ![](https://i.imgur.com/7adiHWv.png) ### Adaptive Learning Rate Can constantly reduce the learning rate while approaching to local minimum, or be adaptive in base of the gradient slope (e.g. `Adagard`). - usable along momentum techniques - **AdaGrad:** downscale a model parameter by **square-root of sum of squares of all its history** ($\textbf{r}$) - good when objective is convex - 👎`too aggressive:` premature decrease to the learning rate ![](https://i.imgur.com/lnXBP6L.png) - **RMSProp:** AdaGrad modification, accumulate an exponentially decaying average of the gradient - $$ \text{Accumulate:} \quad \textbf{r}\leftarrow\rho\textbf{r}+(1-\rho)\hat{\textbf{g}}\odot\hat{\textbf{g}} $$ - where $\rho$ is the decay rate - **ADAM (**ADAptive Moments**):** like **RMSProp+Momentum**, but with `bias correction` ![](https://i.imgur.com/Lg0gboW.png) ### Batch Normalization A method to `easily re-parameterize` a deep network: proposed to solve the **internal covariate shift problem** - layers input distribution changes over training network parameters $\implies$ layers keep trying to adapt to a new distribution $\implies$ learning slow-down > Recently, some scholars have argued that batch normalization does not reduce internal covariate shift, but rather smooths the objective function, which in turn improves the performance. - Wikipedia - 👁‍🗨**Moral:** BN works very well, but probably not for the original arguments ![](https://i.imgur.com/wU6GaSd.png) - 👍 reduces the strong dependence on initialization - 👍 also allows faster learning rates without divergence - 👍 also regularizes the model (no dropout need) ![](https://i.imgur.com/WzifbwM.png) - **normalize** node activations - **scale and shift** (bias) to help against the limited expressive power introduced by output standardization - also learned with _back-propagation_ NB: layers are normally inserted after convolutional or fully connected layers and before non-linearity: ![](https://i.imgur.com/MvaIe5g.png) ## CNN 👓 **CNN (**`Convolutional Neural Networks`**):** feed-forward networks that use _convolution_ instead of general matrix multiplication in `at least one layer`. - in traditional image categorization, feature extraction and classification are separated and independent, in CNN are fused toghether - 👁‍🗨supervised learning: need labeled data - 👍 flexible to many applications (classification, object detection, segmentation, regression, ...) - allow inputs of variable size - type of `feed-forward` networks, but with units having a **spatial arrangement** - **feature maps:** result of `convolution` ![](https://i.imgur.com/m6jq6au.png) - **convolutional filter:** small set of weights (usually 3x3 or 5x5), applied at each position of a layer - $\implies$ **local features extraction:** a unit in a 2D grid can only receive input from units of the previous layer at a similar location - also **parameter sharing:** the `same weights are applied` to inputs of each unit in a layer - ⭐CNN trains for learning weights of those filters, making then highlight the most important features for classification - **hierarchy of filters** to compose complex features from simpler ones (e.g. pixels to edges to shapes) - fully connected layers for **final classification** ### CNN Layer Convolutions and the related sequence of operations - easy concatenations: outputs are still images, but many more than input (`set of filters` are trained in the convolutional layer) ![](https://i.imgur.com/nQtldzE.png) ![](https://i.imgur.com/hoo0Lk5.gif) #### Convolution ![](https://i.imgur.com/folnXOt.png) **Convolution** is an operation $S$ on $2$ functions: $$ S(i,j)=(I\cdot K)(i,j)=\sum\limits_m\sum\limits_nI(m,n)K(i-m,j-n) $$ - $I$ gives the input pixel - $K$ is the `Kernel` matrix that multiplies pixels in an area given the input pixel - the output of $S$ is a `feature map` **Convolution output size =** $(N-K)/S+1$ - $N$ is the input size - $K$ is the kernel size - $S$ is the `stride` (kernel shifting amount) - e.g.: stride $2$ applies convolution every two pixels STRIDE = 2: ![](https://i.imgur.com/IQiMCyh.gif) **Padding:** add $0$s to be borders for kernel to be applied to all input pixels - depending on kernel size ![](https://i.imgur.com/yLWFwDP.png) #### Non-linearity 👍**ReLu:** most used non-linear activation function in CNNs ![](https://i.imgur.com/PhAOSlD.png) #### Spatial Pooling Progressively reduce spatial size of the representation to **control overfitting** and **reduce computation** amount (less parameters) ![](https://i.imgur.com/l3eFtQI.png) ### Understanding CNNs **CNNs are very fragile to adversarial attacks**, need a way to understand what is happening inside a CNN. - many visualization attempts **AlexNet:** born as an image classifier, found uses as a general-purpose `feature extractor` in many other problems - gives 4096 dimensional features - features form given layer can be compressed to lower dimensional space while preserving pairwise distances (`t-SNE`) ![](https://i.imgur.com/gZ40n3y.png) - AlexNet's first layers (see $\text{DeCAF}_1$) give confused representations - AlexNet's last layers (see $\text{DeCAF}_6$) separate classes better than other specialized feature extractors (`LLC`, `GIST`, ...) **Zeiler and Fergus:** attach a `deconvolutional network` to reconstruct images from feature maps of given layer - 👁‍🗨 used only as a probe (no learning) - their method allowed to design a better version of AlexNet ![](https://i.imgur.com/sP1xnqz.png) **Class Model Visualization:** numerically generate an image for a class - e.g., optimization proble wrt. the image: $\text{argmax}_IS_c(I)-\lambda||I||_2^2$ - $S_c$ is the score for class given image $I$ as predicted by the network - the last half is a regularization term **Achitecture Trends over the years:** - less parameters, go deeper - reduce filter sizes ![](https://i.imgur.com/Q2N4Xea.png) Core Principles for Improved CNNs: - **identity propagation:** having shorter subpaths through the networks seem to make training easier in deeper CNNs - `Residuals`, `Dense Blocks` ![](https://i.imgur.com/tC76Jgd.png) - **feature reuse** is definitely a good idea (`parallel branches`) - use `1x1 convolutions` to reduce and expand the number of feature maps judiciously (popular from GoogleNet) ### Object Detection **CNNs are flexible** and can be successfully applied to other tasks with some `architecture re-design` - classification, object detection, segmentation, structural regression, ... - modern architectures permit simultaneous **detection** and **instance segmentation** ![](https://i.imgur.com/U92bY4d.png) Object detection brings new challenges: - structured output (label and bounding box coordinates simultaneously for every object) - data scarcity - poor localization with deep CNNs because of large receptive field - shallow models are better suited #### Object Proposal Can be used as starting point for CNNs. (but is `sub-optimal`) **Selective Search:** a bottom-up unsupervised segmentation grouping for `object proposal` - combination of many grouping criteria (color, texture, brightness, ...) - `object classification` can be applied to proposed regions (R-CNN) - 👍 detect object at any scale - must be fast ![](https://i.imgur.com/r0i2FYB.png) There are many other object proposal algorithms (`Objectness`, ...) #### R-CNN Family 👁‍🗨 Algorithms based on `region proposals`. (look only at parts of image) ![](https://i.imgur.com/W5CMWFZ.png) **R-CNN (**`Regions with CNN`**):** use an object proposer for obtaining many bounding boxes (candidates), and throw them in a CNN classifier - proposal-method agnostic: can use any proposer, not only selective search - 👁‍🗨 **dependence on external algorithms** is sub-optimal for CNNs, but already a big step-up - 👎 a region proposal must be adapted (dilated and scaled) in order to fit as CNN input - 👎 processes all object proposals (~2k) for each image, many of which useless - :x: slow also after training due that many proposals ![](https://i.imgur.com/PCKhgEP.png) **Fast R-CNN:** first feed whole image, then do region proposal on the convoluted feature map and feed on an additional ad-hoc layer - :heavy_check_mark: convolution only once per-image, simultaneously process all the bounding boxes in one network - 👎 still based on region proposals - 👍 **RoI (**Region of Interest**) Pooling layer:** pooling input ROI (e.g.: get max for each sub-section) to have a fixed size output feature map - uses a **multi-task loss:** optimization is a trade-off between classification loss and localization loss of the bounding box ![](https://i.imgur.com/ixKW2ds.png) **Faster R-CNN:** feed whole input image in CNN, a separate network learns region proposals from feature maps - 👍 no external object proposal alghz, a **separate network is used to predict the region proposals** - RoI pooling layer takes self-learned region proposals ![](https://i.imgur.com/9vzP8zH.png) **Mask R-CNN:** extends `Faster R-CNN` for pixel-level **instance segmentation** - additional network branch for predicting segmentation masks from each RoI - 👁‍🗨 can't use RoI Pooling because pixel-level localization is wanted, **RoI Align** is used instead ![](https://i.imgur.com/gIFDtF4.png) #### YOLO ![](https://i.imgur.com/b5AFX8K.gif) **You Only Look Once:** a `single convolutional network` predicts the bounding boxes and the class probabilities for these boxes. - 👍 not region proposal based, uses whole image (fewer false positives in the background) - :heavy_check_mark: real time object detection - 👎 struggles with small objects - subdivide image in grid $\rightarrow$ apply class probability to each cell and predict bounding boxes $\rightarrow$ combine results ![](https://i.imgur.com/bGP9q7g.png) ### Semantic Segmentation Assign one label to `each pixel` of the image. ![](https://i.imgur.com/ryTKXZV.gif) - popular for `autonomous driving` and `medical imaging` - another structured output prediction problem - 👁‍🗨 `end-to-end learning` (no auxiliary steps of a traditional pipeline) is crucial for good performance 💡**Idea 1:** switch from typical CNN to a `fully convolutional CNN` (no fully connected layers) - output is not a class vector, but a feature map showing probability of class for image pixels - :heavy_check_mark: fast without fully connected layers, convolutions are parallelizable 💡**Idea 2:** append `up-sampling layers` that learn from ground truth masks - 👁‍🗨 up-sampling is a form of `deconvolution` - output of same resolution of input image ![](https://i.imgur.com/eQwsxMz.png) - :heavy_check_mark: can use any CNN as back-bone ## Sequence Models ### RNN 🧶 ![](https://i.imgur.com/y9LK3b0.gif) **RNN (**`Recurrent Neural Network`**):** network that exhibit **temporal dynamic behavior**. - developed to deal with variable size data - designed to deal with input space with temporal structure (`sequences`) - e.g.: **NLP (**`Natural Language Processing`**)**, Speech processing (signal in time), ... - 👁‍🗨 `attention` mechanisms very popular also with sequences (improves performance, counters vanishing gradient, ...) - :warning:**Weakness:** in practice RNNs are not very good at learning dependencies distant in time (the longer the chain, the more probable to lose the information) RECURRENT GRAPH and UNROLLED GRAPH for RNN representation: ![](https://i.imgur.com/q733VMC.png) $$ \text{Recurrence:} \quad h_t=f(\textbf{X}_t,h_{t-1}) $$ - $h_t$ is the hidden representation at time $t$ - $\textbf{X}_t$ is an $m \times n_{inputs}$ input matrix at time $t$ - $m$ is the mini-batch size - $n_{inputs}$ is the number of input features - $h_{t-1}$ is the old state, an $m \times n_{neurons}$ matrix containing the hidden state of the previous time-step for all instances - weight matrices: $W_x, W_h, W_y$, also called $U,W,V$ or $W_{xh},W_{hh},W_{yh}$ - 👁‍🗨 are `shared across time` in order to reduce the number of parameters - weights are matrices because can use `mini-batches` of mono-dimensional data and parallelize a bit - $W_x$ is an $n_{inputs} \times n_{neurons}$ matrix containing the connection weights between `input and the hidden layer` - $W_h$ is an $n_{neurons} \times n_{neurons}$ matrix containing the connection weights between `two hidden layers` - $W_y$ is an $n_{neurons} \times n_{neurons}$ matrix containing the connection weights between the `hidden layer and the output` **Variable length** input/output: can `mask` unneeded outputs or do an `average pooling` of them all - Masking example: ![](https://i.imgur.com/DWdkWHA.png) - `N->1:` document analysis (sequence classification), sentiment analysis, ... - `1->N:` image captioning (feed CNN output as input), ... - `N->N:` machine translation, video frame prediction, ... MOST POPULAR **RNN CELLS**: ![](https://i.imgur.com/tg4MSKL.jpg) - 👁‍🗨 the yellow elements are neural network layers **Multi-layer RNNs:** can go deep (many hidden layers) by re-applying RNN cells ![](https://i.imgur.com/GlmC8EI.png) **Bi-directional RNNs:** can make some layers to process the input sequence backward - popular in speech-recognition ![](https://i.imgur.com/gAKWdlW.png) #### Vanilla RNN ![](https://i.imgur.com/ssEpxFQ.png) $$ h_t=\text{tahn}(W_xx_t+W_hh_{t-1}) $$ **BPTT (**`Back-propagation Through Time`**):** follows idea of traditional backprop - treat unfolded network as one big feed-forward - 👎 **vanishing gradient problem** due to BPTT multiplicating gradients over all previous states ![](https://i.imgur.com/9klfbyl.jpg) **Truncated BPTT:** split big sequences in many separated small ones - $\forall$ sequence: unroll the RNN to the right size, back-propagate, roll-up and update the weights, repeat - 👍 reduces computational complexity - 👎 loses temporal dependencies longer than the split size #### LSTM **Long Short-Term Memory:** improvement of vanilla RNN, aims to manage the vanishing gradient problem. - vanishing gradient solution: add a **memory cell** $C$ not subject to matrix multiplication (neuron with self-recurrent connection) - the gradient flow from $C_t$ to $C_{t-1}$ only involves back-propagation through addition and point-wise multiplication ![](https://i.imgur.com/7AvAy5h.png) 👁‍🗨 $\sigma$ indicates a Sigmoid activation function, is the best one to use to implement a gate (?) - `forget gate:` $f_t$ selectively forgets parts of the memory cell - `Input gate:` $i_t$ selectively chooses parts of the candidate for cell update - `Input gate 2:` $\widetilde{C}_t$ or $g_t$ is the vanilla RNN part - `output gate:` $o_t$ modulates how much information from $C_t$ should pass on ANOTHER VIEW: ![](https://i.imgur.com/f4pd4s4.png) #### GRU **Gated Recurrent Unit:** variation of LSTM, uses fewer gates. - combines the forget and input gates into a single “update gate.” - use only hidden state to transfer information ![](https://i.imgur.com/GVV0aWe.png) - `reset gate:` $r_t$ decides how much of past information much be forgotten - `update gate:` $z_t$ decides how much the past information should influence the future neurons ### Sequence Models With Attention **Attention:** general purpose model to make the network to `dynamically focus` on relevant information when predicting a specific output. - popular for image captioning and neural machine translation ![](https://i.imgur.com/jhQGeVm.gif) **Beam search:** is like `sampling` (given an image, feed predicted word as input to next state until certain phrase length or STOP encountered), but **keep $k$ top-scoring candidates**. **BLEU (**`Bi-Lingual Evaluation Understudy`**):** algorithm for evaluating quality of machine-translated text. - very popular metric to measure quality of captions, extended to other tasks - considers `ngrams` of size 1-4 #### Image Captioning ![](https://i.imgur.com/Xb82Q4x.png) - the RNN predicts both a `word` and an `attention map` $\alpha$ - attention map represent a _probability distribution_ (sum of pixel must be $1$) **Context vector:** - **Soft attention:** $z_t=\sum_i\alpha_{i,t}a_i$ each element is a `weighted average` (by $\alpha$) of pixels from a given feature map - train with gradient descent, most used - **Hard attention:** `sample one pixel` from each feature map using distribution given by the attention map - cannot use gradient descent, need reinforcement learning ![](https://i.imgur.com/OjtwZOo.png) #### Neural Machine Translation **Seq2Seq:** use LSTM cells in many-to-many RNN model. - introduces neural machine translation power over traditional statistical ones - 👍 can produce accurate translations even for long sequences - 👎 but the fixed size vector of Seq2Seq for decoding is a `bottleneck` for long sequences - encoder and decoder RNN are `jointly trained` - also GRU version, more lightweight - 2-deep LSTM network 1. `encoder:` maps input to a **context vector** (`fixed size`) 2. `decoder:` has hard time traducing long sentences from compressed vector ![](https://i.imgur.com/bk6g6KW.png) **Translation with Global Attention:** use attention to learn word alignment - 💡idea: translation is an `alignment problem` ![](https://i.imgur.com/TsQqSQN.png) - have a context vector for each moment $t$ - context vector pays more attention to specific input words when decoding specific output words in the target language - $e_{tj}$ measures the likelihood that a certain input word matches a certain output word ![](https://i.imgur.com/savaTsM.png) **Local Attention:** analogous to hard attention, but differentiable - **aligned position** $p_t= L_s\cdot \sigma\big(v_p^T \text{tahn}(W_ps_t)\big)$ - $L_s$ is the length of input sequence - add an exp term to the compatibility function - allows to consider only a `small window` around the predicted aligned position - context vector is the same ![](https://i.imgur.com/Odg5ay3.png) ##### Google Neural Machine Translation - 👍parallelize on `multiple GPUs` - 👍`bi-directional LSTMs` only for the first layer (to save on computation time) - 👍`residual connections` across layers solves the gradient explode / vanish problems because of the “shortcut connection” between the input and output ![](https://i.imgur.com/CMXhu4i.png) #### Other Uses **Residual attention networks:** the attention mask can serve as `feature selector` during forward inference - also act as a gradient update filter during back-propagation ![](https://i.imgur.com/HQYd2fJ.png) **Monocular depth estimation:** good depth maps from combination of features at different scales ![](https://i.imgur.com/UDskc0l.png) ### Convolutional Sequence Models 🎯**Goal:** free from RNN to `improve performance`. 💡Idea: use **1D CNN**: - 👍`hierarchical processing`: bottom-up vs left-right - 👍`homogeneous:` elements processed in same same way (order does not count) - $\implies$👍`scalable computation:` parallelize on GPUs - exploits local dependencies (with the kernel window) ![](https://i.imgur.com/ReFMhfk.gif) - 👁‍🗨 use of **dilated convolutions** #### CNN Based Seq2Seq :heavy_check_mark: Better performance than using RNNs (LSTM, GRU, ...) ![](https://i.imgur.com/DrWjBvk.png) - Bi-directional LSTM encoder replaced by a `CNN encoder` (**attention stack** + **context stack**) - LSTM decoder replaced by a `CNN decoder` (bottom left) - all CNNs are 15 layers deep ### TNN **Transformer Neural Network:** new architecture based on a **self-attention mechanism** - for sequence **transduction** (neural machine translation, speech recognition, text-to-speech, ...) - uses only `fully-connected layers` - 👍👍 better than RNN and CNN: able to capture **longer-term dependencies** - more parallelizable - faster training - better accuracy - 📈 gaining popularity: used by OpenAI and by DeepMind in `AphaStar` (Starcraft) ![](https://i.imgur.com/STowI97.png) - ENCODER: whole sequence as input outputs encoded sequence of same length - DECODER: predicts one words at a time, conditioned by encoders output and previous predicted words **Self-attention layer:** helps the de/encoder to capture the context of relevant words as it encodes a specific word ![](https://i.imgur.com/fAa1ffk.gif) ## Unsupervised Learning 🔮 ![](https://i.imgur.com/ESLUDN7.png) ### Non-probabilistic Models #### Autoencoders **Autoencoder:** network that is able to encode itself (reconstruct the original data) 🎯Goal: **compression**, automatically encode data in order to keep only the most meaningful information. - non-probabilistic unsupervised model - can be interpreted as a non-linear extension of PCA - **symmetrical** structure (weights don't have to) ![](https://i.imgur.com/aUlW4UY.png) - `latent layer:` at the center, where the information is most compressed (between encoding and decoding phases) - must be smaller than input in order to achieve `dimensionality reduction` - `loss metric:` measures the discrepancy between input and output (ideally should be the same) - 💭memo: no use of labels 👁‍🗨Autoencoders can be used to `initialize a supervised model`: after training, just throw away the decoder and predict labels from learned features (encoder fine-tuning) - 👍very useful for semi-supervised learning (few data with labels) **Denoising Autoencoders:** smaller latent layer can be used to force the autoencoder to learn an intelligent representation of the data, but also `adding noise` improves robustness. ![](https://i.imgur.com/svuQfUV.png) ### Generative Models Also called **Probabilistic Models:** given training data (unknown probability distribution), learn to generate new samples from the similar distribution. - **Explicit density estimation:** learn a distribution function $p_{\text{model}}(x)$ and directly use it - `PixelRNN/CNN` (tractable density), `Variational Autoencoder` (approximate density), ... - **Implicit density estimation:** learn model that can sample from $p_{\text{model}}(x)$ without defining it - `GAN`, ... Some use cases - images: realistic faces, sketch to life, super-resolution, colorization, ... - planning: time-series generation - inference of `latent representations` (associable to attributes, e.g.: female face, smiling, ...) CONTEMPLATED GENERATIVE MODELS: ![](https://i.imgur.com/nBGBNjt.png) #### PixelRNN & PixelCNN Are **autoregressive models:** explicit density model that turns the modeling problem into a `sequence problem`: $$ p(\textbf{x})=\prod\limits_{i=1}^np(x_i|x_1,...,x_{i-1}) $$ - use `chain rule` to decompose likelihood of image $\textbf{x}$ into product of conditional pixel distributions - $\implies$ **sequentially predict pixels** instead of predicting image at once (like GAN, VAE, ...) - **learning problem:** `maximize likelihood` of training data - can be used as performance meatric - ❕ likelihood based models do not suffer from problems seen is GANs (mode collapse, lack of diversity, ...) - neural networks can express the complex distributions over pixel values - 👍 good for **image completion** - 👎 do not work well on bigger images (tricks exist) ![](https://i.imgur.com/tIr6m2o.png) - Diagonal Bi-LSTM is more parallelizable to training, but PixelCNN i much faster - 👎 **generation is sequential** in any case, is slow and non parallelizable - can explicitly compute likelihood $p(\textbf{x})$ `PixelCNN++:` some improvements... #### VAE **VAE (**`Variational Autoencoder`**):** **probabilistic** autoencoders - 👍 allows to generate new similar data - 👎 but `blurry` - use an **explicit density model** to learn the parameters of the probability distribution given input samples - relies on Bayesian mathematics - **explicit regularization:** solves overfitting problems of traditional Autoencoders ![](https://i.imgur.com/YYe67Fa.png) ... some math ... tells how to deal with the **intractable** starting problem ... - :heavy_check_mark: **attribute-driven:** his biggest strength; latent factors $z$ represent some features (e.g.: degree of smile, face rotation, ...) - can be used to generate the desired output (e.g.: get attributes from text) #### GAN **GAN (**`Generative Adversarial Networks`**):** - implicit density model, just focus on the ability to sample of the model - 👍 sharp results - 👎 tricky slow training (see training chapter) - 👁‍🗨 born as `fully-connected` architecture, many modifications followed ![](https://i.imgur.com/nv7xWDN.png) - **generator network:** learns to generate items similar to input data (e.g. images) from random noise - **discriminator network:** learns to distinguish between real items and generated ones (score for likelihood in range [0,1]) The two networks are `jointly trained` (**adversarial game**): $$ \text{min}_{\theta_g}\text{max}_{\theta_d}\Big[\mathbb{E}_{x\sim p_{\text{data}}}\text{log}D_{\theta_d}(x)+\mathbb{E}_{z\sim p(z)}\text{log}\Big(1-D_{\theta_d}\big(G_{\theta_g}(z)\big)\Big)\Big] $$ - $D_{\theta_d}$ is the discriminator network parametrized with $\theta_d$ - `gradient ascent:` want to maximize the function (correctly guess real data + discriminate fake data) - $G_{\theta_g}$ is the generator network parametrized with $\theta_g$ - takes as input the noise $z$ - wants to minimize the function by fooling the discriminator, $D(G(z))\rightarrow1$ - :warning: `gradient descent` $\text{min}_{\theta_g}\mathbb{E}_{z\sim p(z)}\text{log}(1-D_{\theta_d}(G_{\theta_g}(z)))$ does not work well (🟦 line, gradient is low when the sample is likely fake) - :heavy_check_mark: `gradient ascent` $\text{min}_{\theta_g}\mathbb{E}_{z\sim p(z)}\text{log}(D_{\theta_d}(G_{\theta_g}(z)))$ is much more robust (🟩 line, the gradient is low when the sample is already good) ![image-20200823175942430](Deep Learning.assets/image-20200823175942430.png) **Alternated training**, but the discriminator updates more times: ![](https://i.imgur.com/9SwGv9Z.png) - after training, only the discriminator is used (obviously) :warning: Traditional GAN optimization problem is `unstable`, either for using gradient ascent or descent for the generator. **Evaluating GAN performance:** - can compare the discriminator with `real persons`, but not accurate enough - as an implicit model, cannot directly compute samples likelihoods - can use the **Inception Score:** good generator should be able to output a `variety` of recognizable classes - 👎 cannot detect overfitting - can use the **Fréchet Inception Distance:** fit simple distributions to statistics of feature activations for real and generated data; estimate divergence parametrically - better than inception score - 👎 still, cannot detect overfitting **GAN Training** - ISSUES: - `non-convergence` and `unstability` - `loss function` does not correlate with sample quality - behavior very sensitive to `hyper-parameter selection` - ❌base GAN suffers from **mode collapse:** generator ends up modeling only a small subset of the training data - `cat and mouse game:` class preference changes over time because of the networks counteracting each other - $\implies$ generator does not converge $\implies$ no model convergence - but image quality often improves... ![](https://i.imgur.com/n7vGcJv.png) - TRICKS: still in beta, under research - **historical averaging:** penalizes parameters divergence from recent values - **feature matching:** append to the _discriminator_ another loss element for feature matching - 👍 counters the mode collapse: the `changed loss function` expands the goal from beating the opponent to matching features in real input data - **virtual batch normalization:** the bias created by standard batch normalization overwhelms the randomness, use a reference batch to combine with the current one - 👍 alleviates the dependence among samples - 👎 heavy computations: feed-forward for both reference batch and current one - **mini-batch discrimination:** compute the similarity of the image $x$ with images in the same batch, and append the similarity $o(x)$ in one of the dense layers in the discriminator - 👍 penalizes mode collapse - 👍 faster training than feature matching? - **label smoothing:** switch from boolean output to real space and penalize the discriminator if too confident (e.g.: $D(\text{real image})>0.9$) - Why? GAN, `overconfidence hurts badly`, a confident discriminator that uses few features to decide, can have those feature heavily exploited by the greedy generator, without long-term benefit. - 👍 improves training stability - use **label informations** as part of the latent space (`CGAN`) Comparison of many GAN implementations, with cost functions: [link](https://github.com/hwalsuklee/tensorflow-generative-model-collections) ##### DCGAN **Deep Convolutional GAN:** extension of GAN that uses CNNs both for generator and discriminator - 👍 enabled **generation of larger resolution images** - 👍 use of `batch normalization` for helping training - up-sampling with transposed convolution (deconvolution) - discriminator architecture: - no pooling, input reduction only by `strided convolutions` - `Leaky ReLu` (because sparse gradients of base ReLu cause training problems in GAN) - just `one FC layer` at the end, before the `softmax output` ![](https://i.imgur.com/ScRfOk9.png) 👁‍🗨 Base DCGAN suffers from **repetition artifacts** (from large scale to checkerboard pattern) - deconvolution is the culprit: need adequate ratio between stride and kernel size - can adopt other techniques ![](https://i.imgur.com/yZsjrGy.png) ![](https://i.imgur.com/15aEmeP.png) **Disentanglement** between attributes of an object: can do arithmetics between latent representations in order to have the desired combo of features - (e.g.: smiling woman - woman + man = smiling man) ##### CGAN **Conditional GAN:** extension of GAN, augment the latent space by appending label information - for both, generator and discriminator - uses loss of original GAN ##### InfoGAN **Information Maximizing GAN:** extension of GAN, label not given, discovers meaningful latent factors in unsupervised way - IDEA: given a latent factor $c$, the input noise is used to generate variations of the same class ![](https://i.imgur.com/6cSXGyC.png) - modified loss function: append a weighted `mutual information` measure ##### Wasserstein GAN Based on the ides that discriminator is easier to train than generator? - uses a completely different loss function, avoids vanishing gradient problem - 👍`improved training:` loss closely related to quality of output Bla bla bla... ##### LSGAN **Least Square GAN:** another attempt to improve the loss, for both discriminator and generator. - 👍 better images than DCGAN - 👍 more robust to mode collapsing ![](https://i.imgur.com/XaadD6R.png) ##### Recent Applications **Pix2Pix GAN:** `Image2Image translation`, learn correspondence between images of different domains - based on CGAN ![](https://i.imgur.com/vqsPG4B.png) - the generator learns to transform input images as wanted ![](https://i.imgur.com/0MfD7eo.jpg) **CycleGAN:** learn cross domain transfer `without pairing` training data - 👍 improvement over `Pix2Pix` - can generate images in either direction - tries to separate the style from the content ![](https://i.imgur.com/edrAoGV.png) - an `additional generator` learns to revert the domain transfer - 👁‍🗨 enforces a style transfer rather than a random generation in the other domain - one discriminator for each domain **Caption2Image:** can embed text as label input, generate images starting from random noise. **SAGAN (**`Self-Attention GAN`**):** details can be generated using cues from all feature locations. - 👍 better generation for high resolution images **SRGAN (**`Super-Resolution GAN`**)** ![](https://i.imgur.com/xVMjbtN.png) **Progressive GAN** and **BigGAN:** ultra-high resolution output [FULL GAN LIST](https://github.com/hindupuravinash/the-gan-zoo) (VAE-GAN, ....) ## Reinforcement Learning 💪 An **agent** is rewarded from `sequence of actions` within an **environment**. ![](https://i.imgur.com/LGJpsT7.png) - 🎯Goal: maximize the expected **cumulative reward** by learning the best `policy` $\pi^*$ - **policy:** a mapping function $\pi(s)$ that gives the optimal action from each state - 👁‍🗨 agent `experience` improves the policy ![](https://i.imgur.com/TVTkbgK.png) - `sparse rewards` are common We are interested in a `stochastic transition model:` the **Markov Decision Process** to learn is represented as matrix, `dynamic programming` is used. - MDP defined by tuple: $(\mathcal{S},\mathcal{A},\mathcal{R},\mathbb{P},\mathcal{\gamma})$ - $\mathcal{S}$ set of possible states - $\mathcal{A}$ set of possible actions - $\mathcal{R}$ distribution of reward given $(s,a)$ - $\mathbb{P}$ transition probability model $P(s'|s,a)$ - $0>\mathcal{\gamma}\ge1$ discount factor, promotes long-term rewards - policy $\pi:\mathcal{S}\rightarrow\mathcal{A}$ - `Markov assumption:` $P(s'|s,a)$ only the last state counts, not all history **State value function:** is the `expected cumulative reward` of following policy $\pi$ from state $s$. $$ V^{\pi}(s)=\mathbb{E}\Big[\sum\limits_{t\ge 0}\gamma^tr(s_t)|\pi,s_0=s\Big] $$ - `optimal value:` $V^*(s)=\text{max}_{\pi}V^{\pi}(s)$ is the cumulative reward achievable by following the best policy - measures state goodness $\rightarrow$ used to find a policy - `optimal policy:` $\pi^*(s)=\text{argmax}_a\sum\limits_{s'}P(s'|s,a)\cdot V^*(s')$ - 👎 need to know the transition model $P$, not convenient **Bellman equation for V:** recursive relationship between optimal values of successive states $$ V^*(s)=r(s)+\gamma\text{ max}_{a}\sum\limits_{s'}P(s'|s,a)\cdot V^*(s') $$ - computed using the optimal policy **Q-value function:** measures the expected cumulative reward from starting state and starting action $$ Q^{\pi}(s,a)=\mathbb{E}\Big[\sum\limits_{t\ge 0}\gamma^tr(s_t)|\pi,s_0=s,a_0=a\Big] $$ - `optimal value:` $Q^*(s,a)=\text{max}_{\pi}Q^{\pi}(s,a)$ - measures state-action `goodness` - `optimal policy:` $\pi^*(s)=\text{argmax}_aQ^*(s,a)$ **Bellman equation for Q-value:** $$ Q^*(s,a)=r(s)+\gamma\sum\limits_{s'}P(s'|s,a)\cdot\text{max}_{a'}Q^*(s',a')\\= \mathbb{E}_{s'\sim P(\cdot|s,a)}\Big[r(s)+\gamma\text{ max}_{a'}Q^*(s',a')|s,a\Big] $$ ![](https://i.imgur.com/DUyZQN9.png) 👁‍🗨Q-value recap: given a state and an action, return the maximum expected future reward, **representable as a matrix**. ### Q-Learning The point is that the reward matrix $R$ is available only to the environment, the agent must learn $Q$ by doing actions: - $Q$ will get close to $R$ over time - the policy $\pi$ can be obtained from $Q$ - 👎 not practical with big state spaces **Algorithm:** 1. initialize $Q$ with zeros 2. select random initial $s_0$ 3. $\forall$ `episode` (set of actions that start on initial state and end on goal state) - while $s\ne s_{\text{goal}}$ do $Q^*(s,a)=R(s,a)+\gamma\text{ max}_{a'}\Big[Q^*(s',a')\Big]$ - select random $a$ for current $s$ and consider going to next state - get maximum $Q(s,a)=Q^*(s,a)$ ![](https://i.imgur.com/fAT4cfQ.png) ❗ After converged $Q$, following highest scoring choices gives an **optimal policy** $\pi$ ❗ ### Value-Based Methods The agent want to optimize a value function. - 👎 in large action spaces the Q-value function can get very complicated #### DQN **Deep Q-Learning:** state spaces can be huge, **approximate** Q-values using a parametric function: $Q^*(s,a)=Q_w(s,a)$ ![](https://i.imgur.com/qALnD4t.png) `Output:` the Q-value of all possible actions from given state. (corresponds to a row of the $Q$ matrix) **Loss function:** changes at each iteration since the target function $y$ is not static $$ L_i(w_i)=\mathbb{E}_{s,a\sim \rho}\Big[\big(y_i(s,a)-Q_{w_i}(s,a)\big)^2\Big] $$ - $s,a$ are sampled from a `behavior distribution` $\rho$ - $y$ is a Bellman equation, but with a $w$ parametrized Q-value - 👁‍🗨is possible to do `gradient update` **Training is unstable:** - non-stationary target function - experiences are correlated and dependent on the policy (which changes rapidly) **Training solutions:** - use a `frozen target Q-network` that every c iterations gets the actual network weights from the `prediction network` - so we have not a moving target to chase ![](https://i.imgur.com/EhEnwBY.png) - use **experience replay:** store recent agent experiences in a table, then sample random mini-batches from it in order to optimize the loss function - lowers correlation, input more i.i.d. - 👍 bigger performance improvement SOME USE CASES ![](https://i.imgur.com/6gerTgB.png) - when learning actions through images (e.g.: video games), convolutional layers are often used ### Policy Gradient Methods Parametrize the policy $\pi$ and directly optimize it: $\pi_{\theta}(s,a)\approx P(a|s)$ gives probability distributions over actions from current state - we want a **stochastic policy representation**: actions are sampled from a uniform distribution - can be the only way to reach optimality with some problems - mathematically convenient - bypass the computation of the $Q$ table - 👍 convenient with large and continuous action spaces - :heavy_check_mark: AlphaGo can play Go (huge state and action space, long sequences, sparse rewards) **Objective function:** find the best parameters $\theta^*$ that maximize the expected reward $$ J(\theta)=\mathbb{E}\Big[\sum\limits_{t\ge 0}\gamma^tr_t|\pi_\theta \Big] = \mathbb{E}_\tau[r(\tau)] $$ - $\tau$ is a `trajectory`: a sequence of (state, action, reward) **The gradient:** whatever... - $p(\tau;\theta)$ is the probability of trajectory $\tau$ under policy with parameters $\theta$ - stochastic approximation $\rightarrow$ sample N trajectories ![](https://i.imgur.com/DCkrxEo.png) - 👍 easy for Softmax policy - 👍 easy for Gaussian policy #### REINFORCE A policy gradient algorithm ![](https://i.imgur.com/ucT1AjT.png) **Reducing variance:** policy gradients suffer from `high variance` and low convergence - stochastic policy may take different actions in different episodes: sampled rewards may be conflicting $\implies$ `difficult convergence` - `solution_1:` consider **causality** (future should not change the past decisions) and **discount reward**, consider only the **cumulative future reward** - change $r(\tau_i)$ with $\sum\limits_{t'\ge t}\gamma^{t'-t}r_{t'}$ : the observed cumulative reward after taking action $a_t$ in state $s_t$ - `solution_2:` add a **baseline** function $V$ to promote actions that are better than expected - change $r(\tau_i)$ with an `advantage function` $Q^{\pi_\theta}(s_t,a_t)-V^{\pi_\theta}(s_t)$ : tells how much an action is better than expected - idea lead to the actor-critic model ### Actor-Critic A critic (the `Q-function`) evaluates an action of the actor (the `policy`). Advantage function: $$ A^{\pi_\theta}(s,a) = Q^{\pi_\theta}(s,a)-V^{\pi_\theta}(s) \approx r+\gamma V^{\pi_{\theta}}(s')-V^{\pi_\theta}(s) $$ - $\implies$ it suffices to learn the value function $V$ (have a neural network that predicts it, optimize with gradient ascent) - combination of value-based and policy-based - 👍 alleviates the task of the critic: it only has to learn the (state, action) pairs generated by the policy - 👍 can use Q-learning tricks (experience replay, ...) ![](https://i.imgur.com/mLLMM22.png) EXAMPLE: **AlphaGo** - a `policy network` initialized by supervised learning on human games, improved by `policy gradient` - a `value network` trains to predict outcome of the game based on self-play ## Fooling Neural Networks 💫 A small **adversarial perturbation** can lead to a completely different outcome. ![](https://i.imgur.com/xP4Hera.png) **Generating preferred inputs:** some methods ![](https://i.imgur.com/oc6l8oP.png) ### Adversarial Perturbations Idea from `supernormal stimuli`: when a stimulus causes a stronger response than the stimulus for which the response evolved (e.g.: junk food) - 👁‍🗨 adversarial examples are not specific of image based CNNs, or of any neural network, is a more general issue (even SVMs suffer from it) - ❗❗❗ neural networks are vulnerable to this because of the `linearity` of operations that take place inside them ❗❗❗ - :warning: can survive printed on paper and even for 3D printed objects **Fooling Approaches:** we want to fool with the smallest perturbation possible `Approach_1:` start with a correctly classified image $x$, find the smallest perturbation $r$ with valid values minimizing $||r||_2$ such that $x+r$ is misclassified. - non-convex optimization problem `Approach_2:` do gradient ascent/descent $\implies$ add a small fraction of weights of the desired class to the image - 👍 easier approach - :arrow_double_down: decrease the score (increase loss) of the correct class $y^*$ - **Fast gradient sign method:** do gradient ascent of the cost function in order to increase model error - $x\leftarrow x+\epsilon\text{ sgn}(\frac{\partial L(x,y^*)}{\partial x})$ - **Iterative gradient sign method:** small steps until misclassified - :arrow_double_up: increase the score (decrease loss) of the incorrect class $\widehat{y}$ - for `linear transformations`, can add a small multiple $r$ of the target class weights to the test example: $r=\epsilon\text{ sgn}(w)$ - the higher the dimensionality, the easier to cause big changes in the output - **Least likely class method:** identify the smallest initial score for a class $\widehat{y}$, and try to do gradient descent towards that class - $x\leftarrow x-\epsilon\text{ sgn}(\frac{\partial L(x,\widehat{y})}{\partial x})$ - 👍 most effective ![](https://i.imgur.com/10fLdOK.png) **Universal Adversarial Perturbations:** find an `image-independent` perturbation vector that causes misclassification for all images in a network. - :warning: after the minimum universal perturbation is found using a training data, the fooling rate keeps high even on unseen data - :warning: also generalizes well across multiple models - :warning: if noticeable, can also confuse humans ![](https://i.imgur.com/jkxKOPi.png) **Black-box Adversarial Perturbations:** can fool also networks we don't have access to - `query abuse:` associate input with output many times in order to **learn a substitute network** ### Adversarial Training Defend against adversarial samples: - `augment` the training set with adversarial samples - input `pre-processing` for disrupting adversarial perturbations - train a `specialized model` to detect adversarial attacks - use highly `non-linear modules` in the network in order to reduce linearity 👁‍🗨 A detector is much harder to fool than a classifier. (images with physical objects are even harder to fool) 🌞 END 🌞