Neural Architecture Search 101

# Neural Architecture Search 101 Neil John D. Ortega --- ### Agenda - Motivation - Taxonomy - AutoML.org - Recap --- ### Motivation - Can we automate the discovery of **novel** neural architectures? - Manual design - takes time, prone to errors, hard to systematize, introduces human bias --- ### Taxonomy - Search space - Search strategy - Performance estimation strategy ![Neural Architecture Search (NAS) methods](https://i.imgur.com/uqFJZBw.png) Fig. 1. Relationships between NAS method categories [1]. Accessed 8 Nov 2020. --- ## Search space ---- ### Search space - What architectures, in principle, can we represent? - Layers, operations, connections, etc. - Search space size vs. human bias ![Neural Architecture Search (NAS) methods](https://i.imgur.com/JBFoXBJ.png) Fig. 1. Relationships between NAS method categories [1]. Accessed 8 Nov 2020. ---- ### Chain-structured networks - Number of layers (could be unbounded) - Type of operation for each layer (e.g. pooling, convolution, etc.) - Hyperparameters for each operation (e.g. number of filters, kernel size, etc. for conv layer, etc.) ---- ### Chain-structured networks ![Chain-structured architecture](https://i.imgur.com/o3xduqL.png =119x318) Fig. 2. A chain-structured network [1]. Accessed 9 Nov 2020. ---- ### Multi-branch networks - Input to layer $i$ is a generic function of outputs of previous layers $0,...,i-1$ - e.g. ResNets, DenseNets ---- ### Multi-branch networks ![Multi-branch architecture](https://i.imgur.com/6daYcxT.png =228x328) Fig. 3. A multi-branched neural network [1]. Accessed 9 Nov 2020. ---- ### Cell-based representation - Cells or blocks - "Mini-networks" as building blocks, instead of individual layers - Cell architecture is learned - Pros :+1: - Drastically reduced search space size - cells usually consist of significantly less layers - Easily adaptable to other data sets - Repeated building blocks proved to be a useful design principle (e.g. LSTM cells, stacked ResNet blocks, etc.) :unamused: - Useful in controlling granularity ---- ### Cell-based representation - Normal cells - Preserves dimensionality of the input - Reduction cells - Reduces dimensionality of the input ![Stacked normal and reduction cell architectures for CIFAR-10 and ImageNet](https://i.imgur.com/tmebqyr.png =244x333) Fig. 4. Stacked normal and reduction cell architectures for CIFAR-10 and ImageNet [3]. Accessed 9 Nov 2020. ---- ### Cell-based representation - Macro- vs. micro-architecture - Ideally, should be learned jointly ---- ### Hierarchical structure - Generalized version of the cell-based representation - Most work used a fixed macro-architecture and optimized the repeated micro-architecture ![Hierarchical structure](https://i.imgur.com/ALhmHib.png =646x287) Fig. 5. An example of a three-level hierarchical structure [2]. Accessed 11 Nov 2020. --- ## Search strategy ---- ### Search strategy - How do we explore the search space? - Exploration vs. exploitation ![Neural Architecture Search (NAS) methods](https://i.imgur.com/MKxLiWe.png) Fig. 1. Relationships between NAS method categories [1]. Accessed 8 Nov 2020. ---- ### Random search - Most naïve baseline ---- ### Reinforcement learning (RL) - Action space - list of hyperparameters ("tokens") generated by the controller used for defining a child network - Reward - validation accuracy of a child network - Loss - optimize the controller parameters $\theta$ with some loss function (e.g. REINFORCE, etc.) ![RL-based NAS schematic diagram](https://i.imgur.com/nz1edcA.png =564x259) Fig. 6. Overview of RL-based NAS [4]. Accessed 11 Nov 2020. ---- ### Reinforcement learning (RL) ![RNN controller for RL-based NAS](https://i.imgur.com/lG0A3ty.png) Fig. 7. How a controller (RNN) is used to generate convolution layers [4]. Accessed 11 Nov 2020. ---- ### Evolutionary algorithms - "Genes" encoding the information to create a network (e.g. connection weights, topology, etc.) - Best results when used to determine the architecture only and not the weights ---- ### Evolutionary algorithms - Grow population with: - Parents genes are the ones with the highest accuracy in every iteration ---- ### Evolutionary algorithms - Grow population with: - Introducing mutations on the genes (i.e. modifying the weights, connections, etc.) ![Mutations on neural network "genes"](https://i.imgur.com/BwmaSqB.png =564x372) Fig. 8. Network architecture mutations in NEAT [5]. Accessed 11 Nov 2020. ---- ### Evolutionary algorithms - Grow population with: - Cross parent genes to create "offsprings" !["Offspring neural networks"](https://i.imgur.com/8ZjbNzJ.png =312x317) Fig. 9. Offspring networks in NEAT [5]. Accessed 11 Nov 2020. ---- ### Gradient descent - Possible but involves converting the discrete search space into a differentiable one (how do you make "adding a layer" differentiable?) - Typically done with joint learning of architecture parameters and network weights --- ## Performance estimation strategy ---- ### Performance estimation strategy - How do we estimate the predictive performance on test data? - Standard training and validation gives correct performance value but is computationally expensive ![Neural Architecture Search (NAS) methods](https://i.imgur.com/3exB1ik.png) Fig. 1. Relationships between NAS method categories [1]. Accessed 8 Nov 2020. ---- ### Train from scratch - Trains every child network independently (hopefully, in parallel) from scratch until *convergence*, then measure validation accuracy - Computationally expensive (~1000 GPU days! :scream:) ---- ### Lower fidelity estimates - Train on a smaller dataset - Train on fewer epochs - Train and evaluate a downsized model during search stage, etc. ---- ### Learning curve extrapolation 1. Train with just a few epochs 2. Model the learning curve (of the child models) as a time-series regression problem 3. Extrapolate using the model ---- ### Weight inheritance - Uses a parent model as a warm start for new child models - Saves a lot of GPU compute especially with even more aggressive weight sharing - Sampled child models can be views as *subgraphs* within the parent *supergraph* ---- ### One-shot models - Only a single model needs to be trained - Weights are then shared across child networks that are just subgraphs of the one-shot model - Uses gradient descent for joint bilevel optimization (optimizes both architecture and weights) ---- ### One-shot models ![One-shot architecture search](https://i.imgur.com/w6VLoZD.png) Fig. 10. Simplified overview of one-shot architecture search [1]. Accessed 12 Nov 2020. --- ### AutoML.org ![AutoML.org](https://i.imgur.com/oAARr6a.png) Fig. 11. AutoML.org focuses on the progressive and sytematic automation of machine learning.[7]. Accessed 12 Nov 2020. ---- ### AutoML.org - Research focus on optimizing and automating ML - Hyperparameter optimization - Neural architecture search - Meta-learning - Released several [NAS benchmarks](https://www.automl.org/nas-4/nasbench/) to organize all the results seen so far and to help guide future research - Released *"a very early pre-alpha version"* of [Auto-PyTorch](https://github.com/automl/Auto-PyTorch) - Has support for image classification --- ### Recap <style> .reveal ul {font-size: 32px !important;} </style> - Neural architecture search aims to automate, systematize the discovery of **novel** neural architectures - Approaches can be clasified according to: - Search space - Search strategy - Performance estimation strategy - [AutoML.org](https://automl.org) is a good place to start if you want to learn more! --- # Thank you! :nerd_face: --- ### References  <style> .reveal p {font-size: 20px !important;} .reveal ul, .reveal ol { display: block !important; font-size: 32px !important; } section[data-id="references"] p { text-align: left !important; } </style> [1] Elsken, Thomas et al. ["Neural Architecture Search: A Survey."](https://arxiv.org/abs/1808.05377) ArXiv abs/1808.05377 (2019). [2] Liu, Hanxiao et al. ["Hierarchical Representations for Efficient Architecture Search."](https://arxiv.org/abs/1711.00436) ArXiv abs/1711.00436 (2018). [3] Zoph, Barret et al. ["Learning Transferable Architectures for Scalable Image Recognition."](https://arxiv.org/abs/1707.07012) 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018). [4] Zoph, Barret and Quoc V. Le. ["Neural Architecture Search with Reinforcement Learning."](https://arxiv.org/abs/1611.01578) ArXiv abs/1611.01578 (2017). [5] Stanley, Kenneth and Risto Miikkulainen. ["Evolving Neural Networks through Augmenting Topologies"](http://nn.cs.utexas.edu/downloads/papers/stanley.ec02.pdf) Evolutionary Computation 10(2): 99-127 (2002). [6] Liu, Hanxiao et al. ["DARTS: Differentiable Architecture Search."](https://arxiv.org/pdf/1806.09055.pdf) ArXiv abs/1806.09055 (2019). [7] [AutoML.org](https://www.automl.org/)