Security & Privacy

Benchmarking Local Differential Privacy for Mobile Devices

By Soohyung Kim Samsung Research
By Anderson Morais Samsung R&D Institute Brazil
By Luiz Miguel Cerqueira Samsung R&D Institute Brazil

Introduction : Why Differential Privacy is Important for Data Gathering?

One of the concerns of big data analytics is the secure and privacy-preserving collection of end-user data. Many legislatures have been catching up with those ideas by means of, for instance, acts like the Health Insurance Portability and Accountability Act (HIPAA) in United States and General Data Protection Regulation (GDPR) in European Union, which set ground rules and legal sanctions for failures on how to handle private and sensitive data. However, those acts do not provide enough ideas on how to deal with possible data leakage points and unwanted actions against user privacy, requirements that must be addressed from the kick-off of any product development. Moreover, telemetry systems and data collection by operating systems, applications, and services present a challenge for the product development and operations management regarding data security

In this work, we analyze and evaluate differential privacy, an approach that relies on injecting controlled stochastic elements in the processing algorithms. Client devices produce data which are collected as raw data into a central server, and a set of algorithms can output: aggregated data, tabulated data, or models as illustrated in Figure 1. Note that everything from the raw data and onwards is under business control, while the other components are located "in the wild". Stochastic components are added to the algorithms, so that different runs of the procedure will produce slightly different outputs or noisy outputs, reducing precision on the outputs. Call the output of a particular instance of this procedure as seen in Figure 1a. In case we randomly remove a single client device from the input and run the procedure again and call the new output , as shown in Figure 1b. The stochastic algorithm is considered differentially private if the probability of and being equal are controlled by a parameter of the algorithm, usually called privacy budget .

(a) With all clients (b) One client is arbitrarily removed

Figure 1.  Output from different executions of the algorithms. Output has random elements due to stochastic elements of the algorithm.

In mathematical terms, let be the complete set of devices, be the set of devices with one device arbitrarily removed, and be an execution of the algorithms with input . We then have that and We want algorithms where:

This equation states, in basic terms, that smaller the privacy budget greater the likelihood of , making the outputs more likely to be similar. If the equation holds, is said to be differentially private.

The stochastic noise level of the algorithm is inversely proportional to the privacy budget . A large budget means the algorithm applies little noise but also has high tolerance for risk and a small budget means the algorithm applies much noise and has small tolerance for risk. This control means that the effect of removing a single client device from the input and the noise inserted by the stochastic parts of the algorithm are indistinguishable, i.e., an outsider cannot determine if the changes on the outputs are because the removal of a target individual or because of the added noise. Fine control of the privacy budget is necessary because the level of noise must be acceptable for an analyst using the aggregated data, tabulated data, and models.

Local Differential Privacy (LDP) has emerged as a comprehensive privacy-preserving model, being resilient to privacy threats in any part of the data collection and data analysis by adding random noise in the data that leaves the client device, combined with data encodings that allows for noise reduction in the data aggregated on the server-side. LDP requires a large amount of client data to work with a reasonable precision and privacy guarantee. Google’s most basic LDP system [1] requires 100.000 unique client reports and 14 million client reports to show results, while Apple’s implementation [2] uses more than 100 million reports and Samsung Research’s implementation [4] uses between 2 and 67 million reports. The reason is that since each user must add noise to their own data, the total amount of noise is much larger. To mitigate this problem, practical LDP applications often use high values of privacy budget .

Our goal is to simulate a realistic environment for data collection in the user device without violation of privacy protection guidelines. This work provides simulations of LDP algorithms RAPPOR [1] and Hadamard [2] comparing their performance with respect to processing time and accuracy using different differential privacy setups for the heavy hitters discovery task. In the context of this task, heavy hitters are strings of interest commonly used by some device configuration or application and the main goal is to identify them and estimate their absolute frequency. Suppose the devices choose their strings from a data dictionary, e.g., a list of font sizes restricted to the options “small”, “medium”, and “large”, or the device model number from a list of existing device models. Two scenarios can be considered: in the first scenario, the server has full knowledge of the dictionary before the analysis begins, and in th second scenario, a completely unknown dictionary has to be inferred from data collected from the devices as presented in Figure 2. Although there exist other LDP algorithms and other estimation tasks [3, 5], our purpose is to evaluate and compare the performance of the most widely-used LDP features for industry. For example, Google has deployed RAPPOR and Apple has applied Hadamard to collect data from users.

(a) Known dictionary (b) Unknown dictionary

Figure 2.  Data dictionary scenarios. In (a), the string sent by a device is selected from an existing list that is known to the decoding algorithms. In (b), each device generates a string from its data, without any commitment to an existing list, the decoding algorithms must infer a list of possible strings.

Benchmark Experiment Setting

     We split the elements of Figure 1 in three benchmark categories: client benchmark, server benchmark, and statistical analysis benchmark. The simulation testbed used to conduct the experiments comprises two computers connected to a LAN, one simulates the clients by executing a Pixel emulator and the other one runs the server software. Both computers equipped with Intel Core i9-9900K CPU 3.60GHz and 64 GB of RAM, the client runs on Ubuntu 20.04 LTS OS and the server runs on Windows 10 OS. In the context of LDP, a report is a data point sent by a client device, and the clients can send more than one report over their lifetime but, for simplicity, each simulated client sends a single encoded report via LAN and it is terminated. The server accumulates reports until it receives a command to export the results.

Figure 3.  Simulation flow where Pixel emulator sends single encoded reports via LAN. The server accumulates reports until it receives a command to export results.

Table 1.  List of experiment configurations

Experiments are executed by following the combination of parameters shown in Table 1. We generate 1 million and 10 million client reports for both Hadamard and RAPPOR using both known and unknown data dictionary setups, plus one run with 100 thousand client reports for the unknown dictionary RAPPOR setup and one run with 100 million clients for the unknown dictionary Hadamard setup. Larger runs for the unknown dictionary RAPPOR setup were prohibitive in terms of server processing time and were not considered. The Hadamard and RAPPOR implementations used are Android versions of the systems presented in [4] for unknown dictionary setups, i.e., the Circular Chain Encoding with fingerprint (CCE) algorithm developed by Samsung Research to address the computational complexity of RAPPOR and its need for large privacy budgets to keep an usable accuracy. A 100-word dictionary is built for every configuration by doing uniform sampling of the Debian’s dictionary located at /usr/share/dict. Input data is then generated by sampling from this 100-word dictionary using Zipf’s distribution.

Computing Time Results

In client side encoding, unknown dictionary setups to take more time to run because they have extra steps, i.e., parts of the user device data dictionary (the n-grams) are encoded and sent in individual reports along with the usual report containing the device’s data point. The RAPPOR encoding with unknown dictionary has an execution time considerably larger than any other scenario, which might be explained by a larger RAM memory usage as shown in Figure 4. We can notice that dispersion for the Hadamard with known dictionary setup is particularly larger, but the median of processing time is lower compared to RAPPOR encoding.

In server side decoding, we can notice that Hadamard decoding with unknown dictionary setup has a processing time similar to known dictionary setups, and the RAPPOR decoding with unknown dictionary has an execution time considerably larger as observed in the client-side encoding scenario. The dispersion for the Hadamard with known dictionary setup is particularly larger but the median is lower than the other setups.

(a) Client side (b) Server side

Figure 4.  Processing time. In (a), the Hadamard runs have a noticeably larger dispersion, but median of processing time is lower than RAPPOR encoding. In (b), it can be noticed the larger dispersion for the known Hadamard runs, but the median of processing time is lower than the other setups.

Concerning statistics production at server side, it can be observed that for unknown dictionary setups, it takes more time to calculate the statistics and the statistics calculation time for Hadamard encoding with unknown dictionary is considerably larger. We can observe that dispersion for the Hadamard setup with known dictionary is particularly larger and the median for the Hadamard setup with unknown dictionary has the highest value. The processing time to export statistics for the RAPPOR encoding is sensitive to the value of the privacy budget ϵ, as shown in Figure 5, but it is much lower than the statistics exportation time for the Hadamard encoding.

(a) Server side (b) Server side for RAPPOR

Figure 5.  Processing time to export statistics. In (a), we can notice the larger dispersion for the known Hadamard runs and the higher median of processing time for unknown Hadamard runs.

Accuracy Results

Three main measures of accuracy were selected for the two types of dictionary setup: (i) True Positive Rate (TPR), also known as recall, and True Negative Rate (TNR), also known as specificity, for known dictionary setups; (ii) Recall for unknown dictionary setups; and (iii) Mean Absolute Percentage Error (MAPE) for both setups.
Let be the number of words reported, let be the real frequency of the w-th word, and the estimated frequency of the w-th word. The MAPE is defined by the formula:

when , we compute the element of the sum as zero. The MAPE gives an intuitive expectation of relative error for word frequency estimates.

In a known dictionary setup, the main challenge of the server is to find out which strings in the dictionary were used by the client devices. If a client device reports a word, that word is a positive, otherwise it means no client reported it then it is a negative. If the server estimates that a word was used and it is a positive, then we have a true positive. The recall is computed as the number of true positives divided by the number of positives and it represents the probability of the server identifying a word actively used by a client device. The second challenge in a known dictionary setup is to find out which strings in the dictionary were not used by any client device. Negative and true negative are defined in an analogous way to the positive and true positive, and the specificity represents the probability of the server to correctly mark a word as not used by any client device. In unknown dictionary setups, there is no fixed dictionary and no words to mark as not used or negatives, so the specificity has no meaning. This analysis focuses on the recall.

For known dictionary setups, the recall is near perfect for all cases, meaning the privacy budget is enough to handle a small size dictionary in all cases as seen in Table 2. If the data had been sampled from a uniform distribution, i.e., the worst-case scenario for LDP, instead of a Zipf distribution, the results might have been different. The specificity varied between 0.39 (for runs with RAPPOR encoding, small , and 1 million reports) and 0.68 (for RAPPOR encoding with =4 and 10 million reports). For the known dictionary setups, the specificity considers the privacy budget and the number of reports, but recall is similar for all runs, i.e., > 0.99. For the parameters used in this scenario, the algorithms executed on the server overidentify words, for example, they assume some noise injected by the privacy mechanisms is an actual report of a given word, then failing to identify this word as a negative. This behavior affects the specificity, making it considerably smaller than the recall.

Regarding unknown dictionary setups, the recall varied between 0.034 (for runs with Hadamard encoding, =2, and 1 million reports) and 1 (for RAPPOR encoding with =8 and 10 million reports) as shown in Table 2. The Hadamard encoding trades accuracy for simplicity in the encoding step, presenting a simpler implementation and low memory footprint for the server, which is reflected in the number of reports required to achieve a usable recall. Recall of the unknown dictionary Hadamard setup exceeds 20% only when using 100 million reports, identifying one in five words used by the client devices. Comparatively, the RAPPOR encoding, that is complex and intensive, achieved similar results with 1 million reports and privacy budget of 4 or 8.

When the number of reports is below 100 million, applying a privacy budget smaller than 4 is unfeasible, as can be noted the recall varies between 3% and 8% for =1, and between 3% and 17% for =2. For =4, the RAPPOR encoding becomes usable when using more than 1 million reports, producing a recall of 23%. For =8, the RAPPOR encoding generates excellent results even when using only 100 thousand reports, as can be seen the recall exceeds 52%. For all values of , the Hadamard encoding presented poor results when using fewer than 100 million reports.

Table 2.  Accuracy for known and unknown dictionary setups. For known dictionary setups, the algorithms identify an extensive number of words as positives, inflating the recall while lowering specificity – the recall is higher than 0.99 for all cases.

Figure 6 shows the MAPE for Hadamard and RAPPOR encodings as a function of the number of reports (one million and ten million) for different values of the privacy budget for known dictionary setups. As expected, the error decreases when the number of reports grows and the lines are ordered by the value of the privacy budget, i.e., the lower the budget, the higher the error. RAPPOR presents lower MAPE than Hadamard except for the case where =1 and the number of available reports is 1 million. The highest error is lower than 20% (the higher value is when using RAPPOR encoding with =1 and the number of reports is 1 million), and the smallest error is lower than 2.5% (the lower value is when using RAPPOR encoding with >2 and 10 million reports is available).

Figure 6.  MAPE of frequency estimates for known dictionary setups.

For unknown dictionary setups, the MAPE for Hadamard encoding as a function of the number of reports (1 million, 10 million, and 100 million) for different values of the privacy budget is shown in Figure 7.a. We can see an expected decrease in the error when the number of reports grows, and the privacy budget does no impact the error when it reaches ϵ=2 as the lines for the values 2, 4, and 8 are superimposed. The MAPE value is higher than 90% for 1 million reports. This is expected because of the low rate of identification as shown in Table 2. And the MAPE is lower than 50% for 100 million reports and privacy budget ≥ 2. Figure 7.b shows the MAPE for RAPPOR encoding as a function of the number of reports (100 thousand and 1 million) for different values of the privacy budget . We can also observe an expected decrease in the error when the number of reports grows, except for =1 due to stochastic fluctuations. The error is greatly reduced when the privacy budget is increased. The MAPE is closer to 100% for 1 million reports with <8, this is expected due to the low rate of identification as shown in Table 2. And the MAPE is closer to 0 for =8 and 1 million reports.

(a) Hadamard encoding (b) RAPPOR encoding

Figure 7.   MAPE of frequency estimates for Hadamard and RAPPOR for unknown dictionary setups.


LDP can be considered a very strong privacy-preserving model and a major technology to protect user’s privacy when collecting and analyzing users’ data. As expected, unknown data dictionary scenarios, which is the most interesting LDP application, is significantly harder to analyze than known data dictionary scenarios, taking more time to process the strings and having reduced accuracy. One of the main concerns regarding LDP setups is the trade-off between processing time and accuracy, the process of decoding the received reports is noisy due to the stochastic nature of the encoding schemes. While RAPPOR encoding is much slower, it presents a better accuracy than Hadamard encoding when the number of reports is more than 10 million. With respect to known dictionary setups, both encodings presented near perfect detection of strings used by the clients, i.e., the true positives. The difference between the encodings in terms of accuracy concerns exclusion of unused strings, i.e., the true negative rate, which varies between 39% (for RAPPOR with =1 and 1 million reports) and 68% (for RAPPOR with =4 and 10 million reports). Regarding unknown dictionary setups, the main accuracy concern is the amount of used strings that is detected, i.e., the true positive rate, which varied between 3.4% (for Hadamard with =2 and 1 million reports) and 100% (for RAPPOR with =8 and 1 million reports).

Link to the paper


[1] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. RAPPOR: Randomized aggregable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, 1054–1067, 2014.

[2] Apple Inc. Differential Privacy Team. Learning with privacy at scale. Apple Machine Learning Journal, 1(8), 1–25, 2017.

[3] Mengmeng Yang, Lingjuan Lyu, Jun Zhao, Tianqing Zhu, Kwok-Yan Lam. Local Differential Privacy and Its Applications: A Comprehensive Survey. ArXiv abs/2008.03686, 2020.

[4] Sungwook Kim, Hyejin Shin, Chunghun Baek, Soohyung Kim, and Junbum Shin. Learning new words from keystroke data with local differential privacy. IEEE Transactions on Knowledge and Data Engineering, 32(3), 479–491, 2018.

[5] Le, Ba Dung, and Tanveer Zia. Discrete Distribution Estimation with Local Differential Privacy: A Comparative Analysis. 2021 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops), 2021