owned this note
owned this note
Published
Linked with GitHub
# 100 Day Challenge - ML Advanced Topics
## Paper Note
### [Neural Architecture Search: A Survey](https://arxiv.org/pdf/1808.05377.pdf)
###### `NAS`
**NAS Overview**
- All possible neural network architectures form a space.
- NAS searchs through this space for optimal architecture for a given dataset and task.
- NAS Abstract Overview
![](https://i.imgur.com/pUwYKgG.png)
- Three Major Components:
- **Search Space**
- **Search Strategy**
- **Performance Estimation**
**Search Space**
- The search space defines which architectures can be represented in principle.
- Prior knowledge for a specific task helps to reduce search space, but introduces human bias, preventing from finding novel architecture.
- Common Search Spaces:
1. Chain-Structured Networks
2. Multi-Branch Networks
3. Cell/Block-Based Networks
- How to choose **meta-architectures** ?
**Search Strategy**
- The search strategy details how to explore the search space.
- Convergence need to be quick and avoid local optimal
- Historical Search Strategies
1. Random search
2. Bayesian optimization
3. Evolutionary methods
4. Reinforcement learning(RL)
5. Gradient-based methods.
6. Hierachical-based search
**Performance Estimation Strategy**
- the process of estimating found architecture's predictive performance on unseen data.
- The simplest way of doing this is to train A on training data and evaluate its performance on validation data. -> Computationally expensive
- Solution 1: Lower Fidelities of the actual performance
- shorter training times
- training on a subset of the data
- lower-resolution images
- less filters per layer.
- Drawback: introduce human bias and change the ranking on architectures.
- Solution 2: Learning Curve Extrapolation
- extrapolate initial learning curves and terminate those predicted to perform poorly to speed up the architecture search process.
- Solution 3 - Network Morphisms
- Initialize the weights of novel architectures based on weights of other architectures that have been trained before.
- Allow search spaces without an inherent upper bound on the architecture’s size
- Strict network morphisms can only make architectures larger and may thus lead to overly complex architectures.
- Solution 4 - One-Shot Architecture Search
- treats all architectures as different subgraphs of a supergraph (the one-shot model) and shares weights between architectures that have edges of this supergraph in common.
---
### [DARTS: Differentiable Architecture Search](https://arxiv.org/pdf/1806.09055.pdf)
###### `NAS`
- Purpose: Find a search strategy
- Method: continuous relaxation -> can apply gradient-based search
- Three Major Methods for NAS search:
- Evolution
- Reinforced Learning
- Continuous Relaxation (Proposed in this paper)
- [Code](https://github.com/quark0/darts)
- Search for a computation cell as the building block of the final architecture.
- A cell is a directed acyclic graph consisting of an ordered sequence of N nodes.
- Continuous Relaxation:
![](https://i.imgur.com/I3YuH8j.png)
- Differential Architecture Search Algorithm:
![](https://i.imgur.com/wHvN5Sc.png)
---
### [Accelerating neural architecture search using performance prediction](https://arxiv.org/pdf/1705.10823.pdf)
###### `NAS`
- Learning curve-based approach
> Human experts are quite adept at recognizing and terminating suboptimal model configurations by inspecting their partial learning curves. In this paper we seek to emulate this behavior and automatically identify and terminate subpar model configurations in order to speedupboth meta-modeling and hyperparameter optimization methods.
- Find learning curve for networks obtained through NAS Search, and early-terminate training if learning curve is not good.
- Some interesting related topics:
- **Neural Network Performance Prediction**: How to predict performance "DURING" training process?
- Meta-modeling: define meta-modeling as an algorithmic approach for designing neural network architectures from scratch.
- Hyperparameter Optimization: e define hyperparameter optimization as an algorithmic approach for finding optimal values of design-independent hyperparameters such as learning rate and batch size, along with a limited search through the network design space.
- This research tried to do NN Performance Prediction by modeling learning curves. How?
![](https://i.imgur.com/V5RfPQG.png)
---
### [Understanding and simplifying one-shot architecture search](http://proceedings.mlr.press/v80/bender18a/bender18a.pdf)
###### `NAS`
- Also focus on NAS search strategies
- Aim to understand weight sharing for one-shot architecture search
- Can efficiently identify promising architectures without either hypernetworks or RL.
- Weight Sharing: rather than training thousands of separate models from scratch, one can train a single large network capable of emulating any architecture in the search space.
- Ex.
![](https://i.imgur.com/5V9PCzF.png)
- As the example above shown, we can apply either a 3x3 convolution, a 5x5 convolution, or a max-pooling layer at a particular position in the network.
- Instead of training three separate models, we can train a single model containing all three operations (the one-shot model).
- Size of choice for one-shot model grows linearly.
- Only used to rank architectures in the search space.
- One-Shot Model - working steps:
1. Design a search space that allows us to represent a wide variety of architectures using a single one-shot model.
2. Train the one-shot model to make it predictive of the validation accuracies of the architectures.
3. Evaluate candidate architectures on the validation set using the pre-trained one shot model.
4. Re-train the most promising architectures from scratch and evaluate their performance on the test set
- One-shot Diagram:
![](https://i.imgur.com/IjCpyGJ.png)
- Argue that NAS search only need gradient descent, don't need hypernetwork or reinforced learning.
- Main comparison: the SMASH paper.
- It is really too much simplified one-shot model...
---
### [SMASH: one-shot model architecture search through hypernetworks](https://arxiv.org/pdf/1708.05344.pdf)
###### `NAS`
- "HyperNet" approach: Use a neural net to generate weights of the main model.
- One-Shot HyperNet: Not fully train a candidate until ranking is done to prevent high computational cost.
- Limitation: Only consider connectivity patterns and unit per layer. Cannot address other hyperparameters such as regularization, learning rate, weight initialization, or data augmentation.
- Training process:
> At each training step, we randomly sample a network architecture, generate the weights for that architecture using a HyperNet, and train the entire system end-to-end through backpropagation.
- SMASH Algorithm:
![](https://i.imgur.com/qNgxzdv.png)
- Two core components:
- Method for sampling architecture:
- Use a memory bank view of feed-forward networks that permits sampling complex, branching topologies, and encoding said topologies as binary vectors.
![](https://i.imgur.com/iqzBJdR.png)
- Method for sampling weights for a given architecture (HyperNet)
- A HyperNet is a neural net used to parameterize the weights of another network, the main network.
---
### [Efficient architecture search by network transformation](https://pdfs.semanticscholar.org/65d5/5e12ce62d8f2fef6e450d3afb815af81a591.pdf)
###### `NAS`
- Reinforced learning approach: use a reinforcement learning agent as the meta-controller for growing the network depth or layer width with function-preserving transformations.
- Limitation: experiment only on architecture space of the plain convolutional neural networks (no skip-connections, branching etc.)
- In theory EAS can explore space other than plain CNN (with skip-connections etc) but not tested in experiment
- EAS RL-based Meta-controller overview:
![](https://i.imgur.com/F89yUUc.png)
- Actor Networks:
- Net2Wider Network:
![](https://i.imgur.com/KZAI3ei.png)
- Net2Deeper Network:
![](https://i.imgur.com/hLAK2D8.png)
---
### [Understanding Locally Competitive Networks](https://arxiv.org/abs/1410.1165)
###### `Activation`
- Activation Functions mentioned in this paper:
- rectified linear (ReL (Glorot et al., 2011))
- maxout (Goodfellow et al., 2013a)
- LWTA (Srivastava et al., 2013)
- Depart from the conventional wisdom in that they are not continuously differentiable (and sometimes non-continuous) and are piecewise linear.
- Something good about them...
- Can be trained faster and better than sigmoidal networks
- Not requiring unsupervised training for weight initialization (Glorot et al., 2011)
- Better gradient flow (Goodfellow et al., 2013a)
- Mitigation of catastrophic forgetting (Srivastava et al., 2013; Goodfellow et al., 2014)
![](https://i.imgur.com/wVCEXi5.png)
- Figure 1: Comparison of rectified linear units (ReLUs), local winner-take-all (LWTA), and maxout activation functions.
- This paper treat them as 'Model of Models'
- Locally Competitvie Networks: Neural networks with activation functions like rectified linear, maxout and LWTA are locally competitive: local competition among units in the network decides which parts of it get activated or trained for a particular input example.
---
### [Fast and Accurate Deep Network Learning by Exponential Linear Units](https://arxiv.org/abs/1511.07289)
###### `Activation`
- ELU: Exponential Linear Unit
- Formula:
![](https://i.imgur.com/ku2IJvf.png)
- Comparing with other activation functions, e.g. rectified linear units (ReLUs), leaky ReLUs (LReLUs) and parametrized ReLUs (PReLUs):
- Similarity:
1. Can alleviate gradient vanishing via the identity for positive values.
- Difference:
1. In contrast to ReLUs, ELUs have negative values which allows them to push mean unit activations closer to zero like batch normalization but with lower computational complexity.
2. Mean shifts toward zero speed up learning by bringing the normal gradient closer to the unit natural gradient because of a reduced bias shift effect.
3. LReLUs and PReLUs have negative values, too, but they do not ensure a noise-robust deactivation state.
4. ELUs saturate to a negative value with smaller inputs and thereby decrease the forward propagated variation and information.
5. In experiments, ELUs lead not only to faster learning, but also to significantly better generalization performance than ReLUs and LReLUs on networks with more than 5 layers.
---
### [Group Normalization](https://arxiv.org/abs/1803.08494)
###### `Normalization`
- Group Normalization (GN) is proposed as an alternative to Batch Normalization in scenarios where small batch size is required.
- Drawback of BN:
- BN’s error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics estimation.
![](https://i.imgur.com/7ingqpQ.png)
- In applications that require small batch size due to memory constrain, BN performance is limited.
- GN Basic Idea:
- divides the channels into groups and computes within each group the mean and variance for normalization.
- Independent to batch size, and its accuracy is stable in a wide range of batch sizes.
- Other approaches to avoid small batch size problem:
- [Layer Normalization (LN)](https://arxiv.org/abs/1607.06450)
- [Instance Normalization (IN)](https://arxiv.org/abs/1607.08022)
- Formula:
![](https://i.imgur.com/EsbIBGh.png)
- Python(Tensorflow) Code:
```
def GroupNorm(x, gamma, beta, G, eps=1e−5):
# x: input features with shape [N,C,H,W]
# gamma, beta: scale and offset, with shape [1,C,1,1]
# G: number of groups for GN
N, C, H, W = x.shape
x = tf.reshape(x, [N, G, C // G, H, W])
mean, var = tf.nn.moments(x, [2, 3, 4], keep dims=True)
x = (x − mean) / tf.sqrt(var + eps)
x = tf.reshape(x, [N, C, H, W])
return x ∗ gamma + beta
```
---
### [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/abs/1502.03167)
###### `Normalization`
- Problem - internal covariate shift: the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change.
- Proposed solution: normalizing layer inputs for each training mini-batch.
- Good: higher learning rates, less careful about initialization
- Internal Covariate Shift: the change in the distributions of internalnodes of a deep network, in the course of training.
- Batch Normalization Transform:
![](https://i.imgur.com/JNdOZoo.png)
- Training Batch-Normalized Network:
![](https://i.imgur.com/47SQcip.png)
---
### [Speeding up the Hyperparameter Optimization of Deep Convolutional Neural Networks](https://arxiv.org/abs/1807.07362)
###### `HYPERPARAMETER`
- Main idea: using a lower-dimensional representation of the original data to quickly identify promising areas in the hyperparameter space.
- Use different search algorithms for hyper-parameter optimization:
- IIS (Main proposed approach)
- Random search
- Tree of parzen estimators (TPEs)
- Sequential model-based algorithm configuration (SMAC)
- Genetic algorithm (GA)
- Argument: hyperparameter optimization problems usually have a low effective dimensionality: even though there is a signi¯cant number of hyperparameters, often only a subset of them has a measurable impact on the performance.
- This work focus on CNN
- IIS Approache
![](https://i.imgur.com/0LzEPGB.png)
---
### [A Walk with SGD](https://arxiv.org/pdf/1802.08770.pdf)
###### `GradientDescent`
- Study based on empirical observations.
- Study the DNN loss surface along the trajectory of SGD
- Discovery:
- The loss interpolation between parameters before and after each training iteration’s update is roughly convex with a minimum (valley floor) in between for most of the training.
- 'Bouncing between walls at a height' mechanism:
- For most of the training update steps, SGD moves in valley like regions of the loss surface by jumping from one valley wall to another at a height above the valley floor.
- Helps SGD traverse larger distance for small batch sizes and large learning rates
- A small batch size injects noise facilitating exploration. This mechanism is crucial for generalization.
- Question:
> Currently there is little theory proving they help in improving generalization in DNNs. This raises the question of whether optimization algorithms that are designed with the goal of solving the aforementioned high dimensional problems also help in finding minima that generalize well, or put differently, what attributes allow optimization algorithms to find such good minima in the non-convex setting.
- SGD trajectory:
1. The loss interpolation between parameters before and after each iteration’s update is roughly convex with a minimum (valley floor) in between. SGD bounces off walls of a valley-like-structure at a height above the floor.
2. Learning rate controls the height at which SGD bounces above the valley floor.
3. Batch size controls gradient stochasticity which facilitates exploration (visible from larger parameter distance from initialization for small batch-size).
4. The valley floor along SGD’s path has many ups and downs (barriers) which may hinder exploration. Thus using a large learning rate helps avoid encountering barriers along SGD’s path by maintaining a large height above valley floor (thus moving over the barriers instead of crossing them).
- Some discovery from related paper:
- The stochastic fluctuation in the stochastic differential equation simulated by SGD is governed by the ratio of learning rate to batch size.
- Consider SGD as a diffusion process where SGD is running a Brownian motion in the parameter space. (interesting...)
- Difference:
> Contrary to these claims, we find that interpolating the loss surface traversed by SGD on a per iteration basis suggests SGD almost never crosses any significant barriers for most of the training.
---
### [Neural Machine Translation and Sequence-to-sequence Models: A Tutorial](https://arxiv.org/abs/1703.01619)
#### Machine Translation
- Formalization:
Source sentence: $F=f_1,\dots,f_J=f_1^{\mid F \mid}$
Target sentence: $E=e_1,\dots,e_I=e_1^{\mid E \mid}$
Any type of f translation system can be defined as: $\hat{E}=mt(F)$
- Statistical machine translation: Maximize probability E given $F$, $P(E \mid F;\theta)$
$\hat{E}=argmax_E P(E \mid F; \theta)$
$\theta$ ~ parallel corpora, learned from data.
#### n-gram Language Models
- n-gram Model: the probability of a sentence $E=e_1^T$:
$P(E)=P(|E|=T,e_1^T)=\prod_{t=1}^{T+1}P(e_t \mid e_1^{t-1})$
- Count-based n-gram Language Models: Count the words/strings in training data.
Maximum Likelihood Estimation: $P_{ML}(e_t \mid e_1^{t-1})=\frac{c_{prefix}(e_1^t)}{c_{prefix}(e_1^{t-1})}$,
where $c_{prefix}(\bullet)$ is the count of the number of times this particular word string appeared at thebeginning of a sentence in the training data.
Only count in previous $n-1$ window: $P(e_t \mid e_1^{t-1}) \approx P_{ML}(e_t \mid e_{t-n+1}^{t-1})$
The $\theta$ in this model: $\theta_{e_{t-n+1}^t}=P(e_t \mid e_{t-n+1}^{t-1})$
- Dealing with two-word strings that has never appeared in the training corpus:
- smoothing probability
- Standard: [Modified Kneser-Ney smoothing](https://arxiv.org/pdf/cmp-lg/9606011.pdf)
- Handling Unknown Words:
- Assume closed vocabulary
- Interpolate with an unknown words distribution
- Add an \<unk\> word:
- Other Topics:
- Large-scale language modeling
- Language model adaptation
- Longer-distance language count-based models
- Syntax-based language models
#### Log-linear Language Models
- Like n-gram, calculating the probability of particular word $e_t$ given a particular context $e_{t-n+1}^{t-1}$, but using the following procedure:
1. Calculating features
2. Calculating scores
3. Calculating probabilities
- Features:
$Context \rightarrow \phi(e_{t-n+1}^{t-1}) \rightarrow \mathbf{x} \in \mathbb{R}^{N}$
where $\mathbf{x}$ is one-hot vector
- Scores:
score $s = W\mathbf{x}+\mathbf{b}=\sum_{j:x_j\neq0}W_{\centerdot,j}\centerdot x_{j}+\mathbf{b}$
where $W=\mathbb{R}^{|V| \times N }$ is a weight matrix, and $\mathbf{b} \in \mathbb{R}^{|V|}$ is a bias vector.
- Probabilities:
$p_{j}=\frac{exp(s_{j})}{\sum_{\tilde{j}}exp(s_{\tilde{j}})} \Rightarrow \mathbf{p}=softmax(\mathbf{s})$
- Model parameters learnt from data:
Loss function: $l(\varepsilon_{train}, \theta)=-log(\varepsilon_{train}|\theta)=-\sum_{E \in \varepsilon_{train}}logP(E|\theta)$
Loss per-word: $l(e_{t-n+1}^{t}, \theta)=logP(e_{t}|e_{t-n+1}^{t-1})$
Learning rate $\eta$ and $\theta$: $\theta \leftarrow \theta - \eta \frac{dl(e_{t-n+1}^{t}, \theta)}{d\theta}$
#### Neural Networks and Feed-forward Language Models
- A tri-gram model example:
![](https://i.imgur.com/SHvbbL8.png)
$\mathbf{m} = concat(M_{., e_{t-2}}, M_{.,e_{t-1}})$
$\mathbf{h}=\tanh(W_{mh}\mathbf{m}+\mathbf{b}_{h})$
$\mathbf{s}=W_{hs}\mathbf{h}+\mathbf{b}_{s}$
$\mathbf{p}=softmax(\mathbf{s})$
<br/>
Loss function: $l = -log(p_{e_{t}})$
#### Recurrent Neural Network Language Models