Security & Privacy
Machine learning (ML) is transforming the world we live in and the way we operate our business. We are experiencing mass adoption of ML at scale that is directly impacting our day to day lives. As technologies around ML evolve, we are seeing data (especially personal data) play a key role in advancing ML. Just like any technology that impacts our lives, we need to responsibly develop and advance ML technology. How we responsibly handle and deal with data (especially personal data) is an imperative for any organization adopting and advancing ML. At Samsung Research, we believe responsibly advancing technology is important and look towards finding ways to move industry’s in adopting privacy enhancing technologies. There has been several related work and research papers around enhancing the privacy of ML data and models. We are actively contributing to this effort too (reference past blogs on privacy). This article shares one particular work in the area of differential privacy and federated learning.
In this article, we outline the challenges that arise from applying differential privacy in a federated learning system. We discuss how we believe these challenges can be solved and lead to improving the privacy posture of the overall system.
Differential Privacy is an algorithm and technique that enables public use of information derived from personal data while withholding information about individuals. Using aggregated data instead of using individual data is a primary application where differential privacy is used to preserve privacy. For example, if one needed to analyze the relationship between annual income and spending trends, instead of providing the salary of each individual (Alice), an aggregated average of the salary (of Alice’s group) can be provided instead. However, simply using aggregated data cannot prevent exposure of personal data. For example, if you look at the difference between the average salary of the group Alice belongs to and the average salary of the group minus Alice, Alice's salary information can be easily inferred. Therefore, when using the aggregated data, there needs to be additional privacy enhancing techniques applied to the aggregated results, and Differential Privacy (DP) plays an important role here.
 
		Figure 1. Informal definition of DP
DP provides a formal way to present privacy guarantees of an individual data for algorithms on aggregate datasets. Informally, an algorithm is said to be differentially private if the inclusion of a single individual record in the dataset does not give a statistically significant effect on the output of the algorithm. Figure 1 illustrates the differential privacy guarantees. M represents a differentially private algorithm, which is often called a mechanism.
In a more formal definition of DP, we introduce a privacy parameter  to express the degree of indistinguishability. We also call this privacy parameter
 to express the degree of indistinguishability. We also call this privacy parameter  , the privacy budget or privacy loss. That is, for input datasets D1 and D2, which differ by one record, and for output S, the upper bound of the ratio between the probabilities that the output of mechanism M on each input is in S is defined using
, the privacy budget or privacy loss. That is, for input datasets D1 and D2, which differ by one record, and for output S, the upper bound of the ratio between the probabilities that the output of mechanism M on each input is in S is defined using  . The most simplified expression of it is shown in the below. We can see that smaller
. The most simplified expression of it is shown in the below. We can see that smaller  guarantees a higher level protection of privacy.
 guarantees a higher level protection of privacy.
 
		Figure 2. (Simplified) Formal definition of DP
The standard approach to achieve DP is adding noise whose size is proportional to the sensitivity of the output. The sensitivity means the maximum of the output difference between two datasets that differ in only one data. For example, Gaussian mechanism is defined as follows.
 
		Figure 3. Definition of Gaussian mechanism
In the Gaussian distribution, the average size of the sample is proportional to the standard deviation, so we can see that the average size of the noise added in the Gaussian mechanism is proportional to the sensitivity. In addition, it is well known that the magnitude of noise in DP using the Gaussian mechanism is inversely proportional to  [1]. Therefore, the average size of the noise is proportional to
 [1]. Therefore, the average size of the noise is proportional to  . When we apply DP to a dataset,
. When we apply DP to a dataset,  is set to meet the target privacy level and noise sampled from the Gaussian distribution with standard deviation proportional to
 is set to meet the target privacy level and noise sampled from the Gaussian distribution with standard deviation proportional to  is added.
 is added.
With respect to improving the privacy posture in an ML system, there is a growing effort to advance Federated Learning (FL). FL is a machine learning technique that uses multiple client devices to train shared ML model without sharing data. For centralized ML systems, data is collected and managed in a centralized manner and used in the training process. However, in a Federated Learning System, local data for each client is managed in a de-centralized manner, limiting the exposure and exchange of data. The ability to limit exposure and exchange of local data is a key driver for enhancing privacy with federated learning in industry and academia.
Federated averaging [2] is a commonly-used algorithm to perform FL. It is a method of training the global model by taking the average of the update vectors obtained from local updates performed by each client. When performing a local model update, the client splits local data into batches and repeatedly performs gradient descent and clipping for each batch. Gradient descent is an optimization algorithm which is commonly-used to train machine learning models and neural networks. This algorithm computes the gradient of the loss function and updates the parameters until the loss function converges to a local minimum. Clipping is a technique that handles the exploding gradients. We control the gradient size by keeping it below the clipping bound. We perform a clipping step after gradient descent to ensure we meet the privacy guarantees.
While Federated Learning prevents the exchange of local data, there are still several challenges that need to be addressed to ensure privacy in an FL system. First, the update vectors of a client can be viewed as sensitive because update vectors can be used to derive or infer the local data used for training. Second, the server that maintains the ML output model from a FL system can also infer or derive local training data of each client. In fact, there have been several studies on privacy concerns regarding these threats and their solutions. This raises the importance of applying differential privacy in FL systems.
To achieve DP in FL systems, one needs to set a suitable target  value tied to a privacy budget. Then, noise is sampled from the corresponding noise distribution with respect to ε which is added to the final output model. What should be considered important here is who generates this noise. In the standard centralized setting, the server generates DP noise and adds it to the trained model to make it differentially private. However, DP noise addition by the server is not preferred because it requires an assumption that the users must trust the server. Distributed DP-noise generation alleviates the need for each client to trust a central noise source. In this case, the noise distribution of each user is appropriately set so that the distribution of noise added to the final output corresponds to the target privacy budget
 value tied to a privacy budget. Then, noise is sampled from the corresponding noise distribution with respect to ε which is added to the final output model. What should be considered important here is who generates this noise. In the standard centralized setting, the server generates DP noise and adds it to the trained model to make it differentially private. However, DP noise addition by the server is not preferred because it requires an assumption that the users must trust the server. Distributed DP-noise generation alleviates the need for each client to trust a central noise source. In this case, the noise distribution of each user is appropriately set so that the distribution of noise added to the final output corresponds to the target privacy budget  .
.
 
		Figure 4. Comparison of classic ML, FL, and FL with DP
We summarize the privacy risks and countermeasures we have discussed in this article so far as follows.
√ Direct exposure of local client data (personal data)
⇒ Send update vectors in place of local data enabled using a Federated Learning System
√ Inferring local client data from local update vectors in Federated Learning Systems
⇒ Generate and apply noise that perturbs client update vectors in a distributed manner (Distributed Noise Generation)
√ Inferring local client data from aggregated data in Federated Learning Systems
⇒ Ensure final outputs are differentially private and correspond to target privacy level
Now, the important thing to consider is which noise distribution each client should sample noise from, in order to achieve the target privacy level. Assuming the distribution of noise corresponding to the target privacy level is  , the way to achieve this in the distributed setting is to ensure that each client locally samples noise independently so that their average draws
, the way to achieve this in the distributed setting is to ensure that each client locally samples noise independently so that their average draws  . A noise vector can be simply sampled from
. A noise vector can be simply sampled from  in a distributed manner because the sum of the independent normal random variables is also distributed normally. More precisely, because the variance of the normal distribution formed by the sum of the noises sampled from each client’s normal distribution is equal to the sum of the variances of each distribution, when each client samples noise in
 in a distributed manner because the sum of the independent normal random variables is also distributed normally. More precisely, because the variance of the normal distribution formed by the sum of the noises sampled from each client’s normal distribution is equal to the sum of the variances of each distribution, when each client samples noise in  , the average of the noise draws
 , the average of the noise draws  .
.
However, is it going to work with the target privacy level even in the actual FL environment? By threat modeling existing privacy enhanced FL systems, we have discovered a new privacy risk (threat). The threat is related to the need for FL systems to operate in a network of multiple clients.
An FL system typically consists of a large number of clients and each client requires real time communication with a server. Therefore it necessarily happens that some devices in an FL system fail to complete the training within a given time and frequently leave the network during the training process. Bonawitz et al. [3] tested their FL system with a few hundred devices and they reported dropout rate was 6~10% on average and up to 30% due to computation errors, network failures, or changes in eligibility. Client dropouts are a typical behavior we should expect in an FL system and the resulting change in the number of clients participating in federated averaging affects the impact a single client’s data has on the final output of federated averaging.
We argue client dropouts impact privacy. For the average function f, sensitivity senf varies depending on the size of the dataset. For example, for a data x in dataset D, if |D| = 10, the effect of x is |x|/10, but if |D| = 10000, the effect of x is bounded by |x|/10000. Because the size of the noise we add to the final output is proportional to  , for the same
, for the same  (with respect to the privacy level), it is proportional only to senf and thus inversely proportional to the number of client. The table below shows that if the number of clients decrease from n to m due to the client dropouts,  the size of the final noise as well as the size of the noise generated by each client needs to increase to make up for increase in client dropouts.
 (with respect to the privacy level), it is proportional only to senf and thus inversely proportional to the number of client. The table below shows that if the number of clients decrease from n to m due to the client dropouts,  the size of the final noise as well as the size of the noise generated by each client needs to increase to make up for increase in client dropouts.
 
		Table 1. Change of variances of noise according to the number of clients
However, the point in time when each client generates noise precedes the time when client dropouts occur. It results in a decrease in privacy, which leads to overspending the privacy budget, so it fails to achieve the expected privacy protection by DP unless the number of rounds is cut. The table below summarizes the change in  and the maximum number of learning rounds that a client participates for a constant privacy budget and client dropout rate. When client dropouts occur, a larger privacy budget
 and the maximum number of learning rounds that a client participates for a constant privacy budget and client dropout rate. When client dropouts occur, a larger privacy budget  is needed for the same number of rounds, which directly leads to a degradation in the privacy level. Viewed differently, the number of rounds that can be trained for constant privacy level decreases when dropouts increase.
 is needed for the same number of rounds, which directly leads to a degradation in the privacy level. Viewed differently, the number of rounds that can be trained for constant privacy level decreases when dropouts increase.
 
		Table 2. Privacy for different dropout rate. The first block means client dropouts increase privacy budget. The second block shows that such privacy degradation reduces the maximal number of rounds that can be learned without exceeding the privacy budget.
Our approach to addressing this risk (threat) is to add a noise calibration process to the algorithm. The Figure 5 illustrates the modified algorithm and approach to which the noise calibration process is added.
We assume that n clients initially participate in the FL system and the server coordinates FL clients to train the model. The server shares the common initial model and the hyper-parameters for training performed by each client. In addition, each client i locally samples a noise vector ei from  so that the average of the noise draws
 so that the average of the noise draws  , where
, where  is standard deviation for a desired privacy level with n clients, i.e.,
 is standard deviation for a desired privacy level with n clients, i.e.,  . Each client adds this noise to the update vector and sends it to the server, and the server aggregates them.
. Each client adds this noise to the update vector and sends it to the server, and the server aggregates them.
 
		Figure 5. FL algorithm with noise calibration
We perform the noise calibration process to refresh an aggregated noise vector when the number of the clients decreases from n to m due to the client dropouts. The server checks and announces which clients have dropped out. Each active client knows its initial noise vector ei. Each client can sample a new noise vector e’i from  . Because ei + e’i draws
. Because ei + e’i draws  , we can see that the noise obtained by adding e’I s to the noise of the previously aggregated vector draws
, we can see that the noise obtained by adding e’I s to the noise of the previously aggregated vector draws  . Subsequently, each client sends its noise vector e’i to the server, and the server applies it on the final output. The noise calibration process plays a role in reducing increased privacy loss. Figure 6 shows the experimental results of accuracy and privacy budgets for MNIST data. As we show from the results, client dropouts cause privacy degradation, but the algorithm we propose operates to acheive the target privacy level without significant accuracy loss.
. Subsequently, each client sends its noise vector e’i to the server, and the server applies it on the final output. The noise calibration process plays a role in reducing increased privacy loss. Figure 6 shows the experimental results of accuracy and privacy budgets for MNIST data. As we show from the results, client dropouts cause privacy degradation, but the algorithm we propose operates to acheive the target privacy level without significant accuracy loss.
 
		Figure 6. Effect of client dropouts and noise calibration
With the advent growth of ML systems at large, Differential Privacy and Federated Learning Systems are gaining traction to enhance the privacy within an ML system. Our work is extending prior work from the research community to further improve the privacy posture. We observed a new privacy risk for FL systems triggered from client dropouts. We proposed and developed a DP mechanism robust to client dropouts by dynamically calibrating noise with account of the dropout rate. From our experiments, we showed that the proposed mechanism increases the level of privacy protection by 15% and 50% for 10% and 30% dropout cases, respectively, over the existing FL mechanisms with DP. We believe that our algorithm provides stability for privacy levels without large accuracy loss where client dropouts are frequent. We are investigating whether our approach can improve the privacy posture in other applications. Details of our research can be found in the full paper.
[1] C. Dwork and A. Roth. The algorithmic foundations of differential privacy, in Foundations and Trends (in TCS), 2014.
[2] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang. Learning differentially private recurrent language models. In ICLR 2018
[3] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Konecný, S. Mazzocchi, B. McMahan, T. V. Overveldt, D. Petrou, D. Ramage, and J. Roselander. Towards Federated Learning at Scale: System Design. In SysML 2019