FjORD: Fair and Accurate Federated Learning under heterogeneous targets with Ordered Dropout

By Stefanos Laskaridis Samsung AI Center - Cambridge

Abstract

Federated Learning (FL) has been gaining significant traction across different ML tasks, ranging from vision to keyboard predictions. In large-scale deployments, client heterogeneity is a fact and constitutes a primary problem for fairness, training performance and accuracy. Although significant efforts have been made into tackling statistical data heterogeneity, the diversity in the processing capabilities and network bandwidth of clients, termed as system heterogeneity, has remained largely unexplored. Current solutions either disregard a large portion of available devices or set a uniform limit on the model's capacity, restricted by the least capable participants. In this work, we introduce Ordered Dropout, a mechanism that achieves an ordered, nested representation of knowledge in deep neural networks (DNNs) and enables the extraction of lower footprint submodels without the need of retraining. We further show that for linear maps our Ordered Dropout is equivalent to SVD. We employ this technique, along with a self-distillation methodology, in the realm of FL in a framework called FjORD. FjORD alleviates the problem of client system heterogeneity by tailoring the model width to the client's capabilities. Extensive evaluation on both CNNs and RNNs across diverse modalities shows that FjORD consistently leads to significant performance gains over state-of-the-art baselines, while maintaining its nested structure.

Federated Learning Setup

Over the last years, more and more applications have been leveraging Deep Learning to provide intelligent feedback to the user, be it in the form of in-camera optimisations [1], intelligent assistants [2] or smart keyboards [3]. Up until recently, the norm has been that such models are trained in large data centers over centralised datasets and then deployed for inference on-device (e.g. smartphones) [4,5]. However, privacy is becoming far more prevalent, and gathering and centralising dataset are no longer viable solutions [6]. Simultaneously, there has been an increase in the computational capabilities of devices, integrating specialised accelerators destined for fast and efficient ML execution [7,8]. Towards this direction, Federated Learning was proposed as an alternative paradigm of training models, without ever directly touching user data [9]. Devices participating in the training process are handed a global model, which they train on their local datasets and only send the updates/gradients upstream to the coordinating server. The server, in turn, aggregates these updates (e.g. by averaging them in the case of FedAvg) and updates the global model, which is handed to the next set of sampled devices in the upcoming training round. This setup is largely referred to as cross-device Federated Learning.

Heterogeneity in Federated Learning

In such a setup, however, there are two factors that make federated learning more difficult. On the one hand we have data heterogeneity, which is manifested in the form of non-Independent and Identically Distributed (non-IID) datasets – i.e. different users hold different data, potentially lacking representation from all available classes. On the other hand there is device heterogeneity, that is devices out there range in terms of their computational, memory and networking capabilities, ranging from low-end to flagship device and of course devices of previous generations. The former type of heterogeneity can pose convergence and accuracy issues, as different users might be solving slightly different tasks with one another [10]. In contrast, the latter form of heterogeneity can cause stragglers [11], i.e. devices that take longer to return their updates. Waiting for them on the server-side can cause severe delays, whereas dropping them after an arbitrary deadline can consistently omit a specific-demographic and prevent integration of their data.

Figure 1. FjORD employs OD to tailor the amount of computation to the capabilities of each participating device.

FjORD

FjORD is tackling the issue of system heterogeneity in Federated Learning by means of Ordered Dropout, an importance-based pruning technique which can create nested submodels of the global model, with smaller footprint. These smaller variants are associated with device clusters, which are defined by the capabilities of their constituents (Figure 1). This way, FjORD leads to:
- Downstream communication gains: Each participating device is handed the maximal model they can handle, which is less or equal to the size of the global model.
- Computational and memory gains: Subsequently each device samples and trains submodels of the model they were handed.
- Upstream communication gains: Last, each device is uploading the updates of the maximally updated model they have updated.

Ordered Dropout

Ordered Dropout (OD), as aforementioned, is a mechanism of importance-based pruning for easy extraction of sub-networks. It is parametrised by:
- a set of candidate submodel widths: P={p1, p2, . . . , pk}
- a discrete distribution over widths: p∼DP.
During training, we sample at each step p∼DP and train the sub-network obtained by keeping only the hidden neurons indexed {0,1, . . . ,⌈p·Kl⌉ −1}, where Kl is #neurons in layer l. Optionally, the p-submodel can distill knowledge from the widest model. For inference we can select any width from P, based on the capabilities of the device at hand. Compared to Random Dropout, Ordered Dropout does not only act as a regulariser, usable only at training-time, but also as a dynamic network pruning technique to be deployed without the need of fine-tuning (Figure 2).

Figure 2. Ordered vs. Random Dropout. In this example, the left-most features are used by more devices during training,
creating a natural ordering to the importance of these features.

In our paper we also illustrate how Ordered Dropout recovers the Singular Value Decomposition (SVD) formulation, i.e. the best k-rank approximation, in the case of a linear model, showing both theoretically and practically the importance-based nature of OD.

Implementation and Evaluation Results

We implemented FjORD on top of PyTorch and flwr [12]. Below is the pseudocode of FjORD's runtime.


Our experiments targeted three datasets, namely CIFAR-10, FEMNIST and Shakespeare, thus including IID and non-IID settings (Table 1).

Table 1. Datasets and models used in the evaluation of FjORD.

We show that FjORD performs better than the baseline of extended Federated Dropout (eFD) - an extension of Federated Dropout [13] that adapts the dropout rate to the device capabilities (Figure 3).

Figure. 3. Ordered Dropout with KD vs. eFD baselines. Performance vs. dropout rate p across different networks and datasets. We use five uniform device clusters here.

Moreover, we show our system's scalability to more device clusters (Figure 4) and adaptability potential to alternative device distributions (Figure 5).

Figure 4. Demonstration of FjORD’s scalability                                 Figure 5. Demonstration of FjORD’s adaptability
              with respect to the number of device clusters.                              across different device distributions.

Conclusion

In our work, we have devised Ordered Dropout and used it towards treating system heterogeneity in the realm of Federated Learning. Our results showed that FjORD’s performance in the local and federated setting exceeds that of competing techniques, while maintaining flexibility across different environment setups. Future work encompasses long-term deployment of FjORD and its extensibility to future-gen devices and models, in a life-long manner. Moreover, we plan to investigate the interplay between system and data heterogeneity for OD-based personalisation and explore alternative dynamic network techniques [14,15] in lieu of Ordered Dropout.

Publication

Our paper appeared as a spotlight presentation at the 35th Conference on Neural Information Processing systems (NeurIPS), 2021.

- Paper: https://proceedings.neurips.cc/paper/2021/hash/6aed000af86a084f9cb0264161e29dd3-Abstract.html
- Poster: https://cdn.gather.town/storage.googleapis.com/gather-town.appspot.com/uploads/dvJbP2PIrIHmxhmk/nZbgRwM3fA39xABVqwjHAD
- Preprint: https://arxiv.org/abs/2102.13451

Acknowledgements: The author would like to thank his team and collaborators: Samuel Horvath, Mario Almeida, Ilias Leontiadis, Stylianos Venieris and Nicholas Lane.

References

[1] Behind the Snapshot: How the Galaxy S21’s AI Improves Your Photos in the Blink of an Eye - https://news.samsung.com/global/behind-the-snapshot-how-the-galaxy-s21s-ai-improves-your-photos-in-the-blink-of-an-eye

[2] William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals. Listen, attend and spell. arXiv preprint arXiv:1508.01211, 2015.

[3] Andrew Hard, Kanishka Rao, Rajiv Mathews, Swaroop Ramaswamy, et al. Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604, 2018.

[4] Kim Hazelwood et al. Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In IEEE International Symposium on High Performance Computer Architecture (HPCA), 2018.

[5] Carole-Jean Wu et al. Machine Learning at Facebook: Understanding Inference at the Edge. In IEEE International Symposium on High Performance Computer Architecture (HPCA), 2019.

[6] European Commission. GDPR: 2018 Reform of EU Data Protection Rules.

[7] Mario Almeida, Stefanos Laskaridis, Ilias Leontiadis, Stylianos I. Venieris, and Nicholas D.Lane. EmBench: Quantifying Performance Variations of Deep Neural Networks Across Modern Commodity Devices. In The 3rd International Workshop on Deep Learning for Mobile Systems and Applications (EMDL), 2019.

[8] Andrey Ignatov, Radu Timofte, Andrei Kulik, Seungsoo Yang, Ke Wang, Felix Baum, Max Wu, Lirong Xu, and Luc Van Gool. AI Benchmark: All About Deep Learning on Smartphones in 2019. In International Conference on Computer Vision Workshops (ICCVW), 2019.

[9] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.

[10] Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. Federated Multi-Task Learning. In Advances in Neural Information Processing Systems (NeurIPS), 2017.

[11] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated Learning: Challenges, Methods, and Future Directions.IEEE Signal Processing Magazine, 2020.

[12] Daniel J Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Titouan Parcollet, and Nicholas D Lane. Flower: A Friendly Federated Learning Research Framework. arXiv preprint arXiv:2007.14390, 2020.

[13] Sebastian Caldas, Jakub Konecný, Brendan McMahan, and Ameet Talwalkar. Expanding theReach of Federated Learning by Reducing Client Resource Requirements. In NeurIPS Workshop on Federated Learning for Data Privacy and Confidentiality, 2018.

[14] Stefanos Laskaridis, Alexandros Kouris, and Nicholas D. Lane. Adaptive Inference through Early-Exit Networks: Design, Challenges and Directions. In Proceedings of the 5th International Workshop on Embedded and Mobile DeepLearning (EMDL). 2021.

[15] Stefanos Laskaridis, Stylianos I. Venieris, Hyeji Kim, and Nicholas D. Lane. HAPI: Hardware-Aware Progressive Inference. In International Conference on Computer-Aided Design (ICCAD), 2020.