Rethinking Architecture Selection in Differentiable NAS

--- layout: post title: Rethinking Architecture Selection in Differentiable NAS tags: [DARTS, NAS, deep-learning] authors: {TBD} --- <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"> </script>  # Rethinking Architecture Selection in Differentiable NAS ### November 2022 | # DARTS # optimization # NAS # deep-learning ### 10-605 Fall 2022, Course Project ### Shiyu He, Liang Li, Wenjia Zhang This post is the course project of 10-605, Fall 2022 semester. In this blog post, we review Rethinking Architecture Selection in Differentiable NAS (Wang et al., [2021](https://arxiv.org/abs/2108.04392))[[1]](#1), a conference paper at ICLR 2021. We start from describling the problem of supernet optimization while doing artichitecture search, then introduce a solution proposed by the paper with theoretical and empirical results. Sepcifically, how do architecture parameters' values reflect the operation strength? Does magnitude of architecture parameters necessarily indicate the operation's contribution to the supernet's performance? In this blog post, we will discuss these questions from multiple different perspectives. We will show that the magnitude of parameters does not necessarily indicate performance contribution, and supernet selection can be significantly improved with an alternative perturbation-based method. ## Table of Contents 1. [Introduction and Background](#sec1) 2. [A new selection method: Measuring operation influence directly](#sec2) 3. [Evaluation with SOTs]() 4. [Result and Analysis]() # ## 1 Introduction and Background<a name="sec1"></a> Neural Architecture Search (NAS) algorithms have been popular across academia and industry for its potential to automatize the process of discovering high-performance architectures. One of the well known methods, DARTS (Liu et al., [2019](https://arxiv.org/abs/1806.09055))[[2]](#2), enables the search process to be performed with a gradient-based optimizer in an end-to-end differentiable manner. Particularly, the resulting supernet can be optimized via gradient-based methods, and the operations associated with the largest architecture parameters are selected to form the final architecture. > DARTS > > DARTS (Liu et al., [2019](https://arxiv.org/abs/1806.09055))[[2]](#2) is a pioneering work that introduces the general differentiable NAS framework, which we will keep talking about thoughout this post. In DARTS, the topology and operation are searched together. Concretely, at the end of the search, it selects one operation for every edge in the normal (reduction) cell based on the architecture parameter α. Then it selects two input edges for each node in the cell by comparing the largest α of every input edge. The final architecture consists of one operation on each of the eight edges in the normal cell. The operation on an edge will be either a "none" op, or selected from a pool of seven candidates: skip connection, avg pool 3x3, max pool 3x3, sep conv 3x3, sep conv 5x5, dil conv 3x3, and dil conv 5x5. > <center><img src="https://i.imgur.com/4FyyHHE.png" width="500" height="250"/> ###### [Figure 1](https://drive.google.com/file/d/1s6vb_KJnjDtqJXHmCFCR-dMWEVEWqx2D/view): An Overview of DARTS in NAS Framework (Borrowed from lecture slides) <img src="https://i.imgur.com/nHXcF1W.png" width="500" height="250"/> ###### [Figure 2](https://paperswithcode.com/method/darts). DARTS Mechanism Overview: (a) Operations on each edge are initially unknown. (b) Continuous relaxation of the search space by placing a mixture of candidate operations on each edge. (c ) Joint optimization of the mixing probabilities and the network weights by solving a bilevel optimization problem. (d) Inducing the final architecture from the learned mixing probabilities. </center> # Despite its simplicity, the effectiveness of DARTS is doubted. For example, a simple randomized search (Li & Talwalkar, [2019](https://arxiv.org/abs/1902.07638))[[3]](#3) outperforms the original DARTS, and several new NAS baselines have been proposed on top of DARTS performance. In addition, the reproducibility issues of published NAS results also causes difficulty for further investigation. Besides, the robustness of DARTS has been doubted by some research. In an experiment (Zela et al. [2020](https://arxiv.org/abs/1909.09656))[[4]](#4), DARTS was tested on four different search spaces, and the observed performance showed significantly degenerated patterns. It is also observed that DARTS degenerates to networks filled with parametric-free operations such as the skip connection or even random noise, leading to the poor performance of the selected architecture. Currently the analysis of DARTS failure mostly concentrates on the supernet optimization procedure, based on the implicit assumption that the values of architecture parameters reflect the operation strength. However, is this assumption reliable when we do architecture selection? In this post, we raise the question: is this underlying assumption true in most cases? We then show analysis by testing the assumption, and prove that in many cases, architecture parameter does not really indicate the operation importance in a supernet. Specifically, neither the operation with larger architecture parameter necessarily results in higher validation accuracy, nor is it a reliable indicator to select the best operation. Instead, the strength of each operation should be evaluated based on its contribution to the supernet performance. On the other hand, a perturbation-based method, that is to select operations of an edge by its supernet accuracy perturbation level, can be utilized to select the final architecture from the supernet. [[click here to go to top]](#sec1) [[click here to go to reference]](#ref) ## 2 A new selection method: Measuring operation influence directly <a name="sec2"></a> ### 2.1 Challenge the Implicit Assumption Now it’s time to take a close look at the relationship between architecture parameter, $\alpha$, and operation strength in order to finally answer the question: Does $\alpha$ necessarily represent the strength of an underlying operation? From here we will use $\alpha$ to represent DARTS architecture parameter. DARTS jointly optimizes $\alpha$ and model weight $w$ with the following bilevel objective via alternative gradient updates: #### $$Min_{\alpha}\ L_{val}(w^*, \alpha), \ \ \ s.t.\ \ \ w^* = arg\ min_{w \in W}\ L_{train} (w, \alpha) \ \ \ \ \ (1)$$ Based on this objective function, $\alpha$ is used to form the output for each edge and compute the final architecture from the supernet, with an implicit assumption that higher $\alpha$ value represents stronger underlying operations. Let's start from giving a precise definition of the strength of an operation. For a single edge on a pretrained supernet; the strength of an operation on the edge can be naturally defined as the supernet accuracy after we discretize to this operation. The operation that achieves the best discretization accuracy can be considered as the best operation for the given edge (with sufficient tringing and finetuning to guarantee convergence, of course). Figure 3 below illustrates the relationship between $\alpha$ and operation strength given randomly selected edge on DARTS supernet. As we can see, the magnitude of $\alpha$ does not necessarily (sometime even inversely!) agree with their strength measured by discretization accuracy at convergence. ![Figure_3](https://i.imgur.com/xR5St7U.png) ###### [Figure 3](https://arxiv.org/abs/2108.04392): $\alpha$ vs discretization accuracy at convergence of all operations on 2 randomly selected edges from a pretrained DARTS supernet (one subplot per edge). # From DARTS supernet's selection perspective in Figure 4, we can see that based on $\alpha$ selection (4(a)), $\alpha_{skip\_connect}$ > $\alpha_{sep\_conv\_3x3}$ on 12 of 14 edges. While based on strength selection (4(b)), the supernet benefits more from discretizing to $sep\_conv\_3x3$ than $skip\_connect$ on half of the edges. Consequently, the derived child architecture selected by $\alpha$ values will lack representation ability and perform poorly due to too many skip connections. <center><img src="https://i.imgur.com/6sLZE3S.png" width="600" height="300"/></center> ###### [Figure 4(a)](https://arxiv.org/abs/2108.04392): Operation strength on each edge of Search space 2 ($skip\_connect, sep\_conv\_3x3$). Edges in red associate with the largest $\alpha$. <center><img src="https://i.imgur.com/aT6iSpw.png" width="600" height="300"/></center> ###### [Figure 4(b)](https://arxiv.org/abs/2108.04392): Operation strength on each edge of Search space 2 ($skip\_connect, sep\_conv\_3x3$). Operations that result in the highest discretization validation accuracy at convergence. Parameterized operations are marked red. # So far we have demonstrated the "skip connection domination" issue caused by selecting architecture based on $\alpha$ parameter values. Several other works [[5]](#5)[[6]](#6) have also verified this issue, pointing out that DARTS tends to assign large $\alpha$ to skip connections, resulting in shallow architectures with poor generability. Some other works [[1]](#1)[[7]](#7) also expanded the topic and proved that this phenomenon by itself is a reasonable outcome while DARTS refines its estimation of the optimal feature map, but will inevitably rendering $\alpha_{skip}$ ineffective in the architecture selection. To solve this, an alternative indicator, operation strength, was proposed, and it shows potential in evaluating each operation's contribution to the supernet performance. ### 2.2 An Alternative: Perturbation-Based Architecture Selection In the previous section, we re-examine the assumption that magnitude of alpha is some indication of the operation strength. In the next section, we will propose the new architecture selection called Perturbation-Based Arcthiecture Selection. As intuitive as the name, this method shifts the focus from alpha . Instead, it seeks for direct relationship between operation strength and superset's performance, and hopes to achieve better training performance. Before we dive into the details, let us get some better understanding of this approach. The major difference of the perturbation-based architecture selection and the traditional DARTs (Liu et al., 2019) is the opposite stand on the assumption that alpha indicates operation importance in a supernet. Typically, during the process of constructing supernet, we use a measured called discretization accuracy to account for how much an operation influences the performance of superset. Discretization is defined as the process of transform continuous values to discrete values by grouping. The disturbance to the superset during the discretization process might impact the validation accuracy during the training time [[1]](#1). Therefore, the process needs to add in additional fine-tuning so supernet could converge. As a result, calculating the validation accuracy of the supernet is quite a computation expensive process, since we need to adjust precisely, or tune, the superset to compute the discretization accuracy at each of the operation for each edge. The new proposed perturbation-based process attempts to reduce the computation cost, while still retaining the same, or even better validation accuracy of the superset. The algorithm introduces a selection process, such that for a set of selected operations, it randomly selected an edge, and evaluate the validation accuracy without that particular selected edge. With all the validation accuracies, we will then identify the best operation for the edge, measured by the operation that led to the largest drop in validation accuracy when removing it. Note that this is similar to the decision tree pruning technique. In order to avoid overfitting and reducing complexity of the decision tree, we often need to tune the parameters of the model. One of the popular methods is reduced error pruning (Elomaa et al., 2001) [[8]](#8), which is only keeping the tree nodes that does not decrease the prediction or validation accuracy. The benefit of the perturbation-based architecture approach is that it preserves the operation that has the biggest influence to the validation accuracy of the supernet, while reducing the expensive computation cost of re-calculating for supernet convergence. <center><img src="https://i.imgur.com/9T8FLAI.png"/> ###### [Figure 5](https://i.imgur.com/9T8FLAI.png): Pseudo-Code of the Perturbation-based Algorithm </center> ### 2.3 Implementation on DARTS This method will be implemented and operated as a layer on top of DARTS. This would replace the discretization process for the entire supernet. Whenever it selects the best operation, that particular operation will be discretized. In order to recover the validation accuracy it lost, we will need to tune the supernet a few epoches of selection. In the implementation examples below, we limited this number to 5. ## 3 Evaluation with SOTs In the next section, we will evaluate the performances of the proposed new method with State-Of-the-Art methods, namely DARTS (Liu et al., 2019) [[2]](#2) and NAS-Bench-201 (Dong & Yang, 2020) [[9]](#9)，as well as other variants of DARTS. In the paragraph below, we will refer the proposed new method perturbation-based Architecture selection as PT. We will also demonstrate how PT can be incorporated into the current state-of-the-arts methods, which will be referred to DARTS-PT. ### 3.1 Performances with DARTS in CNN Search Space Based on the algorithm skeleton we have discussed in Figure 5, we are incorporating the DT as part of the architecture selection process in DARTS. Meanwhile, we hold other search and training process unchanged in DARTS. The empirical result showed that DARTS-DT is performing significantly better than the traditional DARTS, with the test error reduced from 3.00% to 2.61%. We can then compare performances across various of DARTS variants, including SDARTS (Smooth-DARTS) (Chen & Hsieh, 2020) [[10]](#10) and SGAS (Sequantial Greedy Architecture Search) (Li et al., 2020) [[11]](#11). When applying the modification on SGAS, due to its design, it will perform progressive search space shrinking than the other methods. Therefore, when comparing with the these methods, we only compare by substituting the magnitude based operation search method. At the end, we are able to see the test error of best performance of DARTS-DT reached 2.48%, which is ranked among the top comparing to other variants of DARTS. The explanation behind this result is that with DARTS, the architecture selection remains critical, and DARTS-DT appears to be a more competitive method comparing to traditional DARTS with the addition of perturbation architecture selection method. ### 3.2 Performances with NAS-Bench-201 in Search Space In order to expand the applicability of DT, we can further explore the traditional magnitude-based method and the new DT approach to the NAS-Bench-201 searh space. The NAS-Bench-201 follows a similar architecture design as DARTS, both are conducted in cell-based approach (Dong & Yang, 2020) [[9]](#9). The comparsion is conducted over 3 datasets, cifar, cifar100, and imagenet16-120. The cifar consists total of 60,000 images, each of 32x32 colour in 10 classes, and the CIFAR-100 consists of 100 classes containing 600 image (Krizhevsky, 2009) [[12]](#12). As we can see in Figure 6, by ploting the test accuracy, the PT based approach outperforms the magnitude-based selection and is more stable for all three datasets. <center><img src="https://i.imgur.com/Xe7YM9R.png"/> ###### [Figure 6](https://i.imgur.com/Xe7YM9R.png): Test acuracy of magnitude-based method and DT on NAS-Bench-201 Search Space on 3 datasets </center> ### 3.3 Conclusion on Performances Figure 7 shows the complete comparison of DT and the magnitude-based selection counterpart for the same method. The test error results bounded by the red box are significantly outperforming the traditional magnitude-based selection approaches with various of state-of-arts methods, including but not limited to DARTS and NAS-Bench-201. One of the key takeaways is that architecture incorporated with DT seemed to have better performances. The other highlight of the table is that the search cost of replacing magnitude-based selection method seems higher, considering the equivalent number of parameters. For example, it would take on average of 0.8 GPU days to train an DARTS+PT architecture, whereras DARTS would take half of the cost, which is 0.4 GPU days to train the entire architecture. This raises the question of the trade-off between the increase in performances and search cost. <center><img src="https://i.imgur.com/3KjGjQU.png" width="550" height="520"/> ###### [Figure 7]: Test acuracies of state-of-art classifiers architecture on CIFAR-10 </center> ## 4 Result and Analysis * Robustness of DARTS According to Zela et al. (2020)[[13]](#13), DARTS gives a poor and unstable performance while yielding architectures. To examine the issue of robustness with DARTS, the following comparison experiment results showed that DARTS tends to generate architectures with many skip connections and high test error. However, with the implementation of perturbation-based architecture selection combined with DARTS, test errors are improved significantly across different spaces and datasets, and meaningful architectures are generated. This experiment further proved the usefulness of DARTS itself, but not with the original magnitude-based architecture selection. <center><img src="https://i.imgur.com/wxQQ7Mw.png" width="500" height="250"/> ###### [Figure 8]: test error(%) for several differentiable NAS across 4 spaces and 3 datasets </center> <center><img src="https://i.imgur.com/X0j3YJG.png" width="700" height="200"/> ###### [Figure 9]: Comparison of normal cells found on S2 and S4. </center> * Progressive tuning As mentioned before, at each round of operation selection, the supernet will be tuned after discretizing an edge, and this action will help the supernet to regain some accuracy. To eliminate the conjecture that the higher performance of perturbation-based architecture selection comes from this tuning process, an ablation study is conducted to compare the performance of original DARTS (magnitude-based) with tuning with new DARTS (perturbation-based) with tuning. As the result shows, DARTS (perturbation-based) with tuning outperformed for most of the time, which indicates that the operation selection method is the critical part improving the performance. <center><img src="https://i.imgur.com/qfR2FrQ.png" width="400" height="300"/> ###### [Figure 10]: validation accuracy (%) in the operation selection phase on S2. </center> * Fixing $\alpha$ Unlike magnitude-based, perturbation-based architecture selection does not make selections based on the magnitude of $\alpha$, which makes people rethink the necessity of optimizing $\alpha$. After conducting comparison experiments, it is surprising to find that DARTS (perturbation-based) with controlled $\alpha$ performed at least as good as DARTS (perturbation-based). This result indicates that while using the proposed perturbation-based architecture selection, optimizing $\alpha$ could be pointless, and a strong architecture search method is still formed. <center><img src="https://i.imgur.com/1Uaim8H.png" width="450" height="70"/> ###### [Figure 11]: DARTS+PT vs. DARTS+PT(fixed $\alpha$) test error (%) on cifar10. </center> Conclusion This research investigates on the magnitude-based architecture process of DARTS and proved that high magnitude of ɑ does not correspond to high operation strength. Alternatively, the perturbation-based architecture selection method is introduced, which measures the contribution to the supernet’s validation accuracy for each operation on each edge and then makes selections. Various test results acknowledge its much improved performance with DARTS. This new method will provide scientists another approach to generate a better final architecture, and also more flexibility in training supernets without the need to optimize ɑ. The significance of ɑ in Differentiable NAS methods is suggested to be reconsidered. [[click here to go to top]]() [[click here to go to reference]]() <h2 id="ref">References</h2> <a id="1">[1]</a> Ruochen Wang, Minhao Cheng, Xiangning Chen, Xiaocheng Tang, Cho-Jui Hsieh. Rethinking Architecture Selection in Differentiable NAS, [2021](https://arxiv.org/abs/2108.04392) <a id="2">[2]</a> Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In International Conference on Learning Representations, [2019](https://arxiv.org/abs/1806.09055). <a id="3">[3]</a> Liam Li and Ameet Talwalkar. Random search and reproducibility for neural architecture search, [2019](https://arxiv.org/abs/1902.07638). <a id="4">[4]</a> Arber Zela, Thomas Elsken, Tonmoy Saikia, Yassine Marrakchi, Thomas Brox, and Frank Hutter. Understanding and robustifying differentiable architecture search. In International Conference on Learning Representations, [2020](https://arxiv.org/abs/1909.09656). <a id="5">[5]</a> Kaifeng Bi, Changping Hu, Lingxi Xie, Xin Chen, Longhui Wei, and Qi Tian. Stabilizing darts with amended gradient estimation on architectural parameters, [2019](https://arxiv.org/abs/1910.11831). <a id="6">[6]</a> Hanwen Liang, Shifeng Zhang, Jiacheng Sun, Xingqiu He, Weiran Huang, Kechen Zhuang, and Zhenguo Li. Darts+: Improved differentiable architecture search with early stopping, [2019](https://arxiv.org/abs/1909.06035). <a id="7">[7]</a> Klaus Greff, Rupesh K. Srivastava, and Ju ̈rgen Schmidhuber. Highway and residual networks learn unrolled iterative estimation. In International Conference on Learning Representations (ICLR), [2017](https://arxiv.org/abs/1612.07771). <a id="8">[8]</a> Tapio Elomaa, Matti Kaariainen. An Analysis of Redcued Error Pruning. Journal of Artificial Intelligence Research, [2001](https://arxiv.org/abs/1106.0668). <a id="9">[9]</a> Xuanyi Dong, Yi Yang. NAS-Bench-201: Extending the Scope of Reproducible Neural Architecture Search. In International Conference on Learning Representations, [2020](https://arxiv.org/abs/2001.00326v2). <a id="10">[10]</a> Xiangning Chen, Cho-Jui Hsieh. Stabilizing Differentiable Architecture Search via. Perturbation-based Regularization. In International Conference on Learning Representations, [2020](https://https://arxiv.org/abs/2002.05283). <a id="11">[11]</a> Guohao Li, Guocheng Qian, Itzel C. Delgadillo, Matthias Muller, Ali Thabet, Bernard Ghanem1. Conference on Computer Vision and Pattern Recognition (CVPR), [2020](https://www.deepgcns.org/auto/sgas). <a id="12">[12]</a> Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images, Chapter 3: Object Classification Experiments, [2009](https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf). <a id="13">[13]</a> Arber Zela, Thomas Elsken, Tonmoy Saikia, Yassine Marrakchi, Thomas Brox, and Frank Hutter. Understanding and robustifying differentiable architecture search. In International Conference on Learning Representations, [2020](https://arxiv.org/abs/1909.09656).

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.