AI
Neural architecture search (NAS) is quickly becoming the standard methodology for designing deep neural networks (DNNs). NAS replaces human-designed DNNs by automatically searching for a DNN topology and size that maximize performance. This is especially important at Samsung where we customize DNNs for each device. For example, a DNN used in the portrait mode feature for Galaxy phones has to be both accurate to produce the best image quality, and efficient to save battery life. This typically means that a different DNN may have to be designed for each Galaxy phone to enhance the user’s experience.
Figure 1. In NAS, DNNs are iteratively proposed and evaluated until a “good” DNN is found that meets the accuracy (and optionally speed) requirements. However, the evaluation step is time consuming as it often involves full or partial training of the DNN which could take hours, days or even weeks of computation time.
Even though NAS is able to design tailored DNNs effectively, there is still a major challenge with current NAS methodologies: it is very compute-intensive and time-consuming. Conventional NAS approaches simply try out different DNN topologies and measure accuracy (and optionally efficiency) as shown in Figure 1. By evaluating multiple DNNs and providing feedback, the controller iteratively improves in constructing a DNN topology. There is much research into what kind of controller works best, however, we focus on the evaluator in this blog post. Existing evaluators are very slow because the DNN needs to be fully or partially trained – our goal is to reduce that evaluation time as much as possible to be able to search for customized DNNs much more quickly.
In our paper published in the International Conference on Learning Representations (ICLR), we start by scrutinizing standard evaluators. Because full training takes a long time, a reduced form of training is often used when running NAS. There are four main ways to reduce training time as described in previous work [1]:
1. Reduce the number of epochs.
2. Reduce DNN model size by decreasing the number of channels.
3. Reduce the training data resolution, for example, by resizing the input images.
4. Reduce the number of training points by subsampling the training data.
This reduced training is often referred to as a “proxy” because it does not capture the exact full-training accuracy. Instead it simply approximates it with fewer computations and shorter runtime. Reduced-training proxies are often taken for granted, so our first endeavor was to demystify how well these proxies represent the final task. To do this we evaluated a common reduced-training proxy (econas) on all 15,625 models that are available in a popular NAS Benchmark called NAS-Bench-201 [2]. The results were underwhelming. In fact, econas training only had a rank correlation coefficient of 0.6 when compared to final accuracy as shown in Table 1. Additionally, the actual speedup of this proxy was 50x less than its theoretical number of FLOPS improvement because the GPU is vastly underutilized.
We realized that existing proxies for NAS were both inaccurate and slow. So we decided to study proxies at the extreme edge of the accuracy-runtime spectrum. In fact, we decided to investigate proxies which incur almost zero computation overhead. We were inspired by pruning-at-initialization literature in which a single minibatch of data was used to assess the importance of parameters in a neural network [3-7]. We adapted the pruning-at-initialization metrics to evaluate a DNN. These metrics create a saliency value for each parameter in a neural network and we sum the metrics over the entire neural network. This gives us a “zero-cost” score that estimates the performance of a DNN. Our next step was to empirically measure how well our proxies correlate to final training accuracy. For that we evaluated our proxies on a plethora of benchmarks and datasets ranging from vision, speech and natural language processing as presented in our ICLR paper.
Figure 2 shows that our zero-cost proxies can outperform reduced-training proxies on NAS-Bench-201. The rank correlation coefficient is 0.8 with our best zero-cost proxies, which is significantly better than 0.6 for econas. Additionally, we do this with a single minibatch of data, and a single forward/backward pass through the DNN – this is 1000x faster than econas!
Figure 2. Correlation of validation accuracy with final test accuracy on the NAS-Bench-201 search space. We show both the baseline econas proxy and our proposed zero-cost proxies. Many of the proxies we study substantially outperform econas in terms of correlation. All zero-cost proxies are 1000X faster than econas.
It is important to note that our proxy doesn’t perform consistently for all benchmarks and dataset. Additionally, we provide no theoretical guarantees, just empirical evidence. This motivated the need for a low-risk integration of zero-cost proxies within existing NAS methods. To do that, we developed two techniques:
1. Zero-cost warmup: Many NAS algorithms have a warm up phase where a pool of DNNs are collected to initialize the NAS controller. For example, genetic algorithms which are often used in NAS typically have an initial “evolution pool” from which DNNs are repeatedly mutated. With zero-cost warmup, we choose DNNs in that initial pool deliberately based on their zero-cost score. For example, if the pool size is 64, we would compute zero-cost scores for 1000 models and select the top 64 to initialize the pool according to their zero-cost score.
2. Zero-cost move proposal: We guide the selection of the next proposed DNN by the controller using zero-cost proxies. Instead of selecting just 1 model, we select many. We then evaluate their zero-cost scores quickly and choose only the best model to train fully.
To evaluate our zero-cost NAS system thoroughly, we add our warmup/move-proposal methods to four popular NAS algorithms: random search, Reinforcement learning, aging evolution and predictor-based search. In Figure 3, we plot the “search curves” for these four NAS algorithms. Each search curve plots the best accuracy on the y-axis after incurring the cost of training the number of models plotted on the x-axis. As the plots show, by adding either the zero-cost warmup or move proposal method, we are able to find better models much faster than before.
Both warmup and move proposal improve upon the baseline search algorithms and improvements are realised on the four different search algorithms we evaluated.
Figure 3 only shows this for NAS-Bench-201, in which an image classification task is performed on the CIFAR image dataset. Does our method work on other tasks, datasets and search spaces? We repeated the experiments for a speech recognition task on NAS-Bench-ASR, a natural language processing task on NAS-Bench-NLP and another image classification task on NAS-Bench-101. Our improvements are consistent. In fact, we outperformed the best previous method by a 4x search speedup on NAS-Bench-101. You can find more details in our paper presented at the ICLR 2021 conference.
Figure 3: Search speedup with zero-cost proxies on the NAS-Bench-201 CIFAR-100 dataset. We show the impact of augmenting existing NAS algorithms with zero-cost warmup and move proposal. In all cases, we realize a significant speedup in search time to arrive at the same accuracy. The baseline algorithms are plotted: random search (RAND), reinforcement learning (RL), aging evolution (AE) and predictor-based search (BP). For each plot, we augment the search algorithms with zero-cost move proposal (denoted with “+ move”) or zero-cost warmup (denoted with “+ warmup”). The numbers in the parentheses represent the number of models used in either warmup or move proposal.
https://arxiv.org/abs/2101.08134
Mohamed S. Abdelfattah, Abhinav Mehrotra, Lukasz Dudziak, Nicholas D. Lane: Zero-Cost Proxies for Lightweight NAS.
International Conference on Learning Representations (ICLR) 2021.
Code is available at: https://github.com/SamsungLabs/zero-cost-nas
The author would like to thank his team and collaborators: Lukasz Dudziak, Abhinav Mehrotra, Thomas Chau, Hongkai Wen and Nicholas Lane.
[1] Dongzhan Zhou et al.: EcoNAS: Finding Proxies for Economical Neural Architecture Search. Computer Vision and Pattern Recognition (CVPR) 2020.
[2] Xuanyi Dong, Yi Yang.: NAS-Bench-201: Extending the Scope of Reproducible Neural Architecture Search. International Conference on Learning Representations (ICLR) 2020.
[3] Namhoon Lee, Thalaiyasingam Ajanthan, Philip Torr. Single-shot pruning based on connection sensitivity. International Conference on Learning Representations (ICLR) 2019.
[4] Chaoqi Wang, Guodong Zhang, Roger Grosse. Picking Winning Tickets Before Training by Preserving Gradient Flow. International Conference on Learning Representations (ICLR) 2020.
[5] Hidenori Tanaka et al. Pruning neural networks without any data by iteratively preserving synaptic flow. Neural Information Processing Systems (NeurIPS) 2020.
[6] Jack Turner et al. BlockSwap: Fisher-guided Block Substitution for Network Compression on a Budget. International Conference on Learning Representations (ICLR) 2020.
[7] Joseph Mellor et al. Neural architecture search without training. International Conference on Machine Learning (ICML) 2021.