AI

Smart at what cost? Characterising Mobile DNNs in the wild

By Stefanos Laskaridis Samsung AI Center - Cambridge

Introduction

Smartphones are all around us, today, available in various tiers and form factors. A large part of what makes them ‘smart’ is (i) their integration of sensors, so as to ‘sense”’ their environment, and (ii) the ability to run ML models, so as to ‘understand’ their environment. Deep Neural Networks (DNNs) are a central aspect of mobile intelligence, with use-cases ranging from vision tasks, such as computational photography and background separation, to speech, with voice assistants, and NLP, with text or handwriting understanding.

However, DNN inference remains a computationally heavy task, especially with models getting deeper and wider, in pursuit of higher accuracies. While devices have also been getting more capable* and with substantial research toward mobile-specific DNN optimisations [1-4], the heterogeneity of deployed devices out there makes uniform optimisation a tall order. Thus, the question of how DNN inference correlates with computational and energy efficiency across different deployed targets arises.

gaugeNN: Benchmarking DNN models in the wild

To answer this question, we performed the first holistic study of DNN usage in the wild in an attempt to track deployed models and match how these run on widely deployed devices. In our paper, Smart at what cost? Characterising Mobile Deep Neural Networks in the wild [5], accepted at ACM IMC 2021, we analyse over 16K of the most popular apps in the Google Play Store to characterise their DNN usage and performance across devices of different capabilities, in terms of both tiers and generations. Simultaneously, we measure the models’ energy footprint as a core cost dimension of any mobile deployment. To streamline the process, we have developed gaugeNN, a tool that automates the deployment, measurement and analysis of DNNs on devices, with support for different frameworks and platforms. The general workflow and architecture of our system are depicted in Figures 1 and 2 below.

Figure 1. Workflow of gaugeNN

Figure 2. gaugeNN benchmarking platform

In total, we crawled more than 33K applications across two snapshots spanning a year (Table 1) and ran the extracted models on three Samsung smartphones and three Qualcomm development boards (Table 2), whose open-design allows us to measure energy consumption (Figure 1).

Table 1. Dataset snapshots details;                                                                                                                   Table 2. Device specifications

Results and insights

Results from our in-the-wild evaluation have illustrated various interesting insights for the community.

DNNs in AI-powered applications

First and foremost, and in correlation with previous studies [6,7], we witnessed a considerable growth in the amount of DNN models in AI-powered applications, during the past 3 years, going from 176 in 2018 to 1,666 in April 2021. These results demonstrate how the proliferation of mobile ML frameworks, the availability of pre-trained models and the constant improvement of mobile hardware have driven this growth, and the need to keep up with this ever-increasing adoption.

Q: Are all these models unique?
A natural question to ask is whether all these models are different. To answer this question, we analysed model uniqueness, with weight checksums both at model-level (for pre-trained models) and at layer-level (for fine-tuning). Our results suggest that it is common for developers to leverage a pre-trained model that is widely available and pay the significantly smaller cost of training offline only a subset of the last DNN layers. While online on-device training is a prominent future avenue, be it through fine-tuning or federated learning, current support in mobile frameworks is limited, as are such deployments.

DNN operations

Q: Where are DNN models most frequently deployed?
By analysing the app and task categories that models are utilised for, we found that vision models are the most prevalent, with a focus on object and face detection as well as text recognition, used mostly across communication, photography and beauty apps. This can also be correlated with the most frequent type of DNN operations, which are reported to be convolutions (Figure 3). Not only are they largely found in vision models, but they can also map well to mobile hardware for efficient execution, compared to, for example, recurrent layers [8]. While depth-wise convolutions can significantly improve performance, their deployments are scarcer as they can impact the quality of the model.

Figure 3. Model layer composition per input modality for TFLite, NCNN and caffe

On-device performance

Q: How do these models perform on different devices?
Furthermore, we found that there is huge variance in terms of FLOPs and parameters (four orders of magnitude) in the traced models (Figure 4). This might be attributed to the granularity of the task corresponding to a single inference. In fact, we observe a wide variability of inference latency across devices, even for models that have similar FLOP counts, which is in line with previous research of ours [9] and reaffirms the need for on-device benchmarking.

Figure 4. Observed relationship between latency and FLOPs across different devices

Devices of different tiers and generations offer variable dynamics, with the lower-tier falling significantly behind in performance (Figure 5). Even devices integrating the same SoC can offer variable performance due to vendor-specific configurations, the installed apps and drivers, or even due to different thermal characteristics. Therefore, given this heterogeneity, it is hard for developers to accurately predict the users' experience without testing their models on a large sample of devices.

Figure 5. Inference latency ECDF per device

Energy consumption is also a major component in mobile, and intelligence comes at a cost to battery life. Unlike latency, however, which is visibly improved with new generations of devices, energy consumption seems to be predominantly dependent on the model architecture. Even though newer hardware might improve in power-efficiency (Figure 6), differences are much less pronounced compared to performance improvements, which are even less observable across different model architectures. This suggests that it is the AI developers who can optimise battery life the most, unlike plain latency which can be improved at multiple levels, including manufacturers.

Figure 6. Distributions of inference energy, power and efficiency of the collected models when run across 3 generations of Qualcomm SoCs.
The lines represent kernel density estimations.

Per-device specialisation & optimisations

Q: Can we optimise DNN model deployment?
With respect to the DNN model distribution to devices, we found no evidence of per target device customisation. While Google Play Services allow for models in Android applications to be distributed post-installation (e.g. through OBBs [10] or Asset Delivery [11]), this functionality may be underutilised in the realm of mobile ML or that developers choose not to specialise their models per device SoC or model. Specialising the model distribution per device target can be beneficial for performance and energy – it requires offline vendor-specific customisation of the model. Evidently, app developers seem to prefer generality of their deployment solutions, in line with [12], and defer optimisation to middleware in the stack, such as NNAPI [13] drivers or specific hardware delegates [14]. Even under this premise, however, results from our experiments tell a mixed story about hardware and framework-specific optimisations: while it can yield noticeably better performance across models, this is not always the case due to driver implementation or other low-level confounding factors. The dilemma of target generality vs. hardware-specific optimisations ultimately lies in the hand of the developer and the resources they have at their disposal to extract every bit of performance in hardware. Very often an alternative is to offload ML computation to a remote endpoint altogether, thus avoiding on-device bottlenecks and heterogeneity [15], but at the price of inference latency. Our results indicate that cloud APIs from Google and Amazon are gaining in popularity as they allow developers to quickly deploy AI capabilities without the need for specialised ML expertise and costly infrastructure for training. Moreover, developers do not need to maintain training data on-premise, and the resulting apps can be supported by heterogeneous devices with similar QoE.

Figure 7. #apps invoking cloud-based ML APIs. Categories with <10 apps are excluded.

Conclusion - implications and future trends

Based on the results of our analysis we’ve drawn certain conclusions about the state of mobile AI today, and conjecture about the trends and challenges of the future of mobile AI.

Proliferation of mobile AI. Our results indicate that both on-device and cloud-supported DNN applications are increasing rapidly (doubled within a year). This is mostly driven by the availability of pre-trained models and easy-to-use cloud-based APIs, offering turnkey solutions for ML.

DNNs and mobile hardware resources. We witness that most applications do not take advantage of SoC-specific accelerators to accelerate their inference runtime, but rather target generality of their solutions, either by shipping vanilla CPU-only execution or by integrating framework-specific middleware options (NNAPI). Simultaneously, offloading inference [2,16] to the cloud offers a consistent QoE, which is not dependent on the target device, at the expense of privacy and monetary cost. This behaviour comes as a consequence of the fragmentation in the Android ecosystem in terms of hardware capabilities and software support (e.g. vendor-specific NNAPI drivers). Consequently, we anticipate the need of automated solutions for optimised development and deployment of ML solutions in mobile apps, which abstract away the complexity of efficiency and heterogeneity of the ecosystem.

Energy as a bottleneck. While Deep Learning adoption is undisputed, with an accelerating trajectory in the future, manufacturers turn to specialised hardware for faster and more efficient ML (e.g. NPUs). However, the same cannot be stated for battery technology and capacity, which remains relatively stale. We anticipate energy sooner or later becoming a bottleneck in DNN deployment, requiring novel solutions to support mobile intelligence on-the-go.

DNN co-habitation. With more and more applications shipping DNN-powered solutions, we also anticipate the co-existence and parallel runtime of more than one DNN in the future. Thus, researchers will need to tackle this emerging problem to efficiently support such runtimes, by means of OS or hardware-level solutions.

On-device learning and personalisation. With users becoming more and more privacy aware, and with legislation discouraging the storage of user data without legitimate interest, on-device training and federated learning [4] become more and more prevalent [17,18]. Moreover, with the proliferation of on-device data, on-device personalisation [19] is also gaining traction. These tasks will create a different workload to be optimised for on-device runtime, for which current or future tools will need to provide support.

Publication

Our paper appeared at the ACM International Measurements Conference (IMC), 2021.

- Paper: https://dl.acm.org/doi/abs/10.1145/3487552.3487863
- Preprint: https://arxiv.org/abs/2109.13963

Acknowledgements: The author would like to thank his team and collaborators: Mario Almeida, Abhinav Mehrotra, Lukasz Dudziak, Ilias Leontiadis and Nicholas Lane.

Link to the paper

https://dl.acm.org/doi/abs/10.1145/3487552.3487863

Footnotes

*integrating more transistors than ever before, as well as specialised accelerators on the SoC for such workloads (e.g. GPUs, NPUs)



References

[1] Stefanos Laskaridis, Stylianos I. Venieris, Hyeji Kim, and Nicholas D. Lane. 2020. HAPI: hardware-aware progressive inference. In 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD), pp. 1-9.

[2] Stefanos Laskaridis, Stylianos I. Venieris, Mario Almeida, Ilias Leontiadis, and Nicholas D. Lane. 2020. SPINN: Synergistic Progressive Inference of Neural Networks over Device and Cloud. In The 26th Annual International Conference on Mobile Computing and Networking (MobiCom).

[3] Alexandros Kouris, Stylianos I. Venieris, Stefanos Laskaridis, and Nicholas D. Lane. 2021 Multi-Exit Semantic Segmentation Networks. arXiv preprint arXiv:2106.03527.

[4] Samuel Horvath, Stefanos Laskaridis, Mario Almeida, Ilias Leontiadis, Stylianos I. Venieris, and Nicholas D. Lane. 2021. FjORD: Fair and Accurate Federated Learning under heterogeneous targets with Ordered Dropout In NeurIPS'21.

[5] Mario Almeida, Stefanos Laskaridis, Abhinav Mehrotra, Lukasz Dudziak, Ilias Leontiadis, and Nicholas D. Lane. 2021. Smart at what cost? Characterising Mobile Deep Neural Networks in the wild. In Proceedings of the 21st ACM Internet Measurement Conference (IMC '21). Association for Computing Machinery, New York, NY, USA, 658–672. DOI: https://doi.org/10.1145/3487552.3487863

[6] Mengwei Xu, Jiawei Liu, Yuanqiang Liu, Felix Xiaozhu Lin, Yunxin Liu, and Xuanzhe Liu. 2019. A first look at deep learning apps on smartphones. In The World Wide Web Conference. 2125–2136.

[7] Zhichuang Sun, Ruimin Sun, Long Lu, and Alan Mislove. 2021. Mind Your Weight(s): A Large-scale Study on Insufficient Machine Learning Model Protection in Mobile Apps. In 30th USENIX Security Symposium (USENIX Security21). USENIX Association https://www.usenix.org/conference/usenixsecurity21/presentation/sun-zhichuang

[8] Xingyao Zhang, Chenhao Xie, Jing Wang, Weidong Zhang, and Xin Fu. 2018. Towards Memory Friendly Long-Short Term Memory Networks (LSTMs) on Mobile GPUs. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-51). IEEE Press, 162–174. https://doi.org/10.1109/MICRO.2018.00022

[9] Mario Almeida, Stefanos Laskaridis, Ilias Leontiadis, Stylianos I Venieris, and Nicholas D Lane. 2019. EmBench: Quantifying Performance Variations of Deep Neural Networks across Modern Commodity Devices. In The 3rd International Workshop on Deep Learning for Mobile Systems and Applications (EMDL). 1–6.

[10] APK Expansion Files - https://developer.android.com/google/play/expansion-files

[11] Play Asset Delivery - https://developer.android.com/guide/playcore/asset-delivery

[12] Carole-Jean Wu et al. 2019 Machine Learning at Facebook: Understanding Inference at the Edge. In IEEE International Symposium on High Performance Computer Architecture (HPCA).

[13] Neural Networks API - https://developer.android.com/ndk/guides/neuralnetworks

[14] TensorFlow Lite Delegates - https://www.tensorflow.org/lite/performance/delegates

[15] Andrey Ignatov, Radu Timofte, Andrei Kulik, Seungsoo Yang, Ke Wang, Felix Baum, Max Wu, Lirong Xu, and Luc Van Gool. 2019. AI Benchmark: All About Deep Learning on Smartphones in 2019. In International Conference on Computer Vision (ICCV) Workshops.

[16] Mario Almeida, Stefanos Laskaridis, Stylianos I. Venieris, Ilias Leontiadis, and Nicholas D. Lane. 2021. DynO: Dynamic Onloading of Deep Neural Networks from Cloud to Device. (2021). To appear in Special Issue on Special Issue on Accelerating AI on the Edge in Transactions of ACM Embedded Computing Systems (TECS), 2022. https://arxiv.org/abs/cs.DC/2104.09949

[17] Matthias Paulik, Matt Seigel, Henry Mason, Dominic Telaar, Joris Kluivers, Rogier van Dalen, Chi Wai Lau, Luke Carlson, Filip Granqvist, Chris Vandevelde, et al. 2021. Federated Evaluation and Tuning for On-Device Personalization: System Design & Applications. arXiv preprint arXiv:2102.08503 (2021).

[18] Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloé Kiddon, Jakub Konečný, Stefano Mazzocchi, Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselander. 2019. Towards Federated Learning at Scale: System Design. In Proceedings of Machine Learning and Systems , A. Talwalkar, V. Smith, and M. Zaharia (Eds.), Vol. 1. 374–388. https://proceedings.mlsys.org/paper/2019/file/bd686fd640be98efaae0091fa301e613-Paper.pdf

[19] Ilias Leontiadis, Stefanos Laskaridis, Stylianos I. Venieris, and Nicholas D. Lane. 2021. It’s Always Personal: Using Early Exits for Efficient On-Device CNN Personalisation. In Proceedings of the 22nd International Workshop on Mobile Computing Systems and Applications (HotMobile ’21). Association for Computing Machinery, New York, NY, USA, 15–21. https://doi.org/10.1145/3446382.3448359