Workshop

Theory and Practice of Differential Privacy

Rachel Cummings, Gautam Kamath

Abstract:

Differential privacy is a promising approach to privacy-preserving data analysis. It has been the subject of a decade of intense scientific study, and has now been deployed in products at government agencies such as the U.S. Census Bureau and companies like Microsoft, Apple, and Google. MIT Technology Review named differential privacy one of 10 breakthrough technologies of 2020.
Since data privacy is a pervasive concern, differential privacy has been studied by researchers from many distinct communities, including machine learning, statistics, algorithms, computer security, cryptography, databases, data mining, programming languages, social sciences, and law. We believe that this combined effort across a broad spectrum of computer science is essential for differential privacy to realize its full potential. To this end, our workshop will stimulate discussion among participants about both the state-of-the-art in differential privacy and the future challenges that must be addressed to make differential privacy more practical.

Chat is not available.

Timezone: »

Schedule

Fri 7:00 a.m. - 7:05 a.m.
Opening Remarks   
Gautam Kamath, Rachel Cummings
Fri 7:05 a.m. - 7:45 a.m.
  

Sequences of adaptively chosen queries issued against a fixed dataset are at the core of machine learning and data-driven sciences, but adaptivity can quickly lead to overfitting. In recent years, differential privacy---interpreted as a notion of stability---has emerged as a tool for (theoretically) protecting against such adaptivity by preventing query answers from encoding too much information about the dataset. In this talk, we'll explore how differential privacy achieves this, and begin to examine whether differential privacy is overkill for protecting against adaptivity.

Katrina Ligett
Fri 7:45 a.m. - 8:30 a.m.
Poster Session 1 (Poster Session)  link »
Fri 8:30 a.m. - 8:55 a.m.
Break
Fri 8:55 a.m. - 9:40 a.m.
  

Featuring three talks:

The Distributed Discrete Gaussian Mechanism for Federated Learning with Secure Aggregation Peter Kairouz, Ziyu Liu, Thomas Steinke

Differentially Private Sampling from Distributions Sofya Raskhodnikova, Satchit Sivakumar, Adam Smith, Marika Swanberg

Differentially Private Algorithms for 2020 Decennial Census Detailed DHC Race & Ethnicity Samuel Haney, William Sexton, Ashwin Machanavajjhala, Michael Hay, Gerome Miklau

Marika Swanberg, Samuel Haney, Peter Kairouz
Fri 9:40 a.m. - 10:40 a.m.
 link »

Sign up for the mentoring session here: https://docs.google.com/spreadsheets/d/115UeFid4FTbjvLUomXwSvRii6Bd7N51csRaQuv1HZWg/edit#gid=0

The mentoring session will be at the following location: https://eventhosts.gather.town/g8UqjSjgPhpiOX0q/tpdp-lounge

Fri 10:40 a.m. - 11:25 a.m.
  

Featuring three talks:

Hiding Among the Clones: A Simple and Nearly Optimal Analysis of Privacy Amplification by Shuffling Vitaly Feldman, Audra McMillan, Kunal Talwar

Adaptive Machine Unlearning Varun Gupta, Christopher Jung, Seth Neel, Aaron Roth, Saeed Sharifi-Malvajerdi, Chris Waites

A Practitioners Guide to Differentially Private Convex Optimization Ryan McKenna, Hristo Paskov, Kunal Talwar

Saeed Sharifi-Malvajerdi, Audra McMillan, Ryan McKenna
Fri 11:25 a.m. - 12:10 p.m.
Poster Session 2 (Poster Session)  link »
Fri 12:10 p.m. - 12:35 p.m.
Break
Fri 12:35 p.m. - 1:15 p.m.
  

Member privacy is a priority at LinkedIn and a key consideration in product design. We aim to protect member privacy by using innovative privacy engineering techniques such as differential privacy to provide valuable insights to members and policy makers. The differential privacy team at LinkedIn has developed new algorithms that work with existing real time data analytics systems, allowing us to develop systems at scale while protecting member privacy. This talk will cover several deployments of LinkedIn products that incorporate differential privacy.

Ryan Rogers
Fri 1:15 p.m. - 2:00 p.m.
Poster Session 3 (Poster Session)  link »
Fri 2:00 p.m. - 3:00 p.m.
Social Hour  link »
-
[ Visit Poster at Spot B4 in Virtual World ]

We present an asymptotically optimal (epsilon, δ)-private mechanism for answering multiple, adaptively asked, ∆-sensitive queries, settling the conjecture of Steinke and Ullman [2020]. Our algorithm adds independent noise of bounded magnitude to each query, while prior solutions relied on unbounded noise such as the Laplace and Gaussian mechanisms.

Yuval Dagan, Gil Kur
-

We consider training models on private data that is distributed across user devices. To ensure privacy, we add on-device noise and use secure aggregation so that only the noisy sum is revealed to the server. We present a comprehensive end-to-end system, which appropriately discretizes the data and adds discrete Gaussian noise before performing secure aggregation. We provide a novel privacy analysis for sums of discrete Gaussians. We also analyze the effect of rounding the input data and the modular summation arithmetic. Our theoretical guarantees highlight the complex tension between communication, privacy, and accuracy. Our extensive experimental results demonstrate that our solution is essentially able to match the accuracy to central differential privacy with less than 16 bits of precision per value.

Peter Kairouz, Ziyu Liu, Thomas Steinke
-
[ Visit Poster at Spot B6 in Virtual World ]
In this work, we consider a randomized-response mechanism called the edgeFlip, which releases a sanitized graph satisfying $\epsilon$-differential privacy. We show that for Random Dot Product Graphs (RDPGs), which are have emerged as a powerful paradigm for statistical analysis of graphs, the output of edgeFlip is also a RDPG. Then, using tools from the burgeoning area of Topological Data Analysis (TDA), we show that when the structure underlying a RDPG in the latent space is supported on a lower-dimensional manifold $\mathcal{M}$, then the $\epsilon$-edge DP synthetic graph obtained via edgeFlip is supported on a manifold $\mathcal M'$ with identical topological features. Additionally, for the privacy budget $\epsilon = 0$, the manifold $\mathcal M'$ retracts to a single point, making the RDPG equivalent to an Erdős-Rényi graph. In essence, for $\epsilon > 0$, the privacy mechanism warps the original manifold $\mathcal M$ in a way such that the subtle topological features are still preserved. Furthermore, we assess the quality of the spectral embedding of the RDPG using persistence diagrams. Asymptotically, we can show that even though the limiting persitence diagram obtained via edgeFlip is different from that of the original graph, the shift-invariant bottleneck distance (a variant of the bottleneck distance which identifies the same input metric space measured in two different units) between the two limiting persitence diagrams converges to zero. We illustrate the advantages of employing edgeFlip in comparison to other alternatives. Lastly, we highlight the benefit of the topological perspective by employing ToMaTO---a topologically-motivated clustering algorithm ---as an alternative to the k-Means algorithm for spectral clustering. To the best of our knowledge, our work is the first to examine the structure of a differential-privacy mechanism through the lens of algebraic topology and statistical inference.
Siddharth Vishwanath, Jonathan Hehir
-
[ Visit Poster at Spot C2 in Virtual World ]

We consider how to release personalized privacy losses using per-instance differential privacy (pDP), focusing on private empirical risk minimization over the class of generalized linear models. Standard differential privacy (DP) gives us a worst-case bound that might be orders of magnitude larger than the privacy loss to a particular individual relative to a fixed dataset. The pDP framework provides a more fine-grained analysis of the privacy guarantee to a target individual, but the per-instance privacy loss itself might be a function of sensitive data. In this paper, we analyze the per-instance privacy loss of releasing a private empirical risk minimizer learned via objective perturbation, and propose a group of methods to privately and accurately publish the pDP losses at little to no additional privacy cost.

Rachel Redberg, Yu-Xiang Wang
-
[ Visit Poster at Spot D1 in Virtual World ]

Differential Privacy (DP) has become a gold standard in privacy-preserving data analysis. While it provides one of the most rigorous notions of privacy, there are many settings where its applicability is limited. Our main contribution is in augmenting differential privacy with {\em Flexible Accuracy}, which allows small distortions in the input (e.g., dropping outliers) before measuring accuracy of the output, allowing one to extend DP mechanisms to high-sensitivity functions. We present mechanisms that can help in achieving this notion for functions that had no meaningful differentially private mechanisms previously. In particular, we illustrate an application to differentially private histograms, which in turn yields mechanisms for revealing the support of a dataset or the extremal values in the data. Analyses of our constructions exploit new versatile composition theorems that facilitate modular design. All the above extensions use our new definitional framework, which is in terms of ``lossy Wasserstein distance'' -- a 2-parameter error measure for distributions. This may be of independent interest.

Aman Bansal, Rahul Chunduru, Deepesh Data, Manoj Prabhakaran
-
[ Visit Poster at Spot C6 in Virtual World ]
Representing a sparse histogram, or more generally a sparse vector, is a fundamental task in differential privacy. An ideal solution would use space close to information-theoretical lower bounds, have an error distribution that depends optimally on the desired privacy level, and allow fast random access to entries in the vector. However, existing approaches have only achieved two of these three goals. In this paper we introduce the Approximate Laplace Projection (ALP) mechanism for approximating $k$-sparse vectors. This mechanism is shown to simultaneously have information-theoretically optimal space (up to constant factors), fast access to vector entries, and error of the same magnitude as the Laplace-mechanism applied to dense vectors. A key new technique is a \emph{unary} representation of small integers, which is shown to be robust against ``randomized response'' noise. This representation is combined with hashing, in the spirit of Bloom filters, to obtain a space-efficient, differentially private representation. Our theoretical performance bounds are complemented by simulations which show that the constant factors on the main performance parameters are quite small, suggesting practicality of the technique.
Martin Aumüller, Christian Lebeda, Rasmus Pagh
-
[ Visit Poster at Spot C6 in Virtual World ]

Many problems in machine learning rely on multi-task learning (MTL), in which the goal is to solve multiple related machine learning tasks simultaneously. MTL is particularly relevant for privacy-sensitive applications in areas such as healthcare, finance, and IoT computing, where sensitive data from multiple, varied sources are shared for the purpose of learning. In this work, we formalize notions of task-level privacy for MTL via joint differential privacy(JDP), a relaxation of differential privacy for mechanism design and distributed optimization. We then propose an algorithm for mean-regularized MTL, an objective commonly used for applications in personalized federated learning, subject to JDP. We analyze our objective and solver, providing certifiable guarantees on both privacy and utility. Empirically, our method allows for improved privacy/utility trade-offs relative to global baselines across common federated learning benchmarks.

Shengyuan Hu, Steven Wu, Virginia Smith
-
[ Visit Poster at Spot A2 in Virtual World ]

Reproducibility is vital to ensuring scientific conclusions are reliable, and researchers have an obligation to ensure that their results are replicable. However, many scientific fields are suffering from a "reproducibility crisis," a term coined circa 2010 to refer to the failure of results from a variety of scientific disciplines to replicate. Within the subfields of machine learning and data science, there are similar concerns about the reliability of published findings. The performance of models produced by machine learning algorithms may be affected by the values of random seeds or hyperparameters chosen during training, and performance may be brittle to deviations from the values disseminated in published results.

In this work, we aim to initiate the study of reproducibility as a property of learning algorithms. We define a new notion of reproducibility, which informally says that a randomized algorithm is reproducible if two distinct runs of the algorithm on two samples drawn from the same distribution, with internal randomness fixed between both runs, produces the same output with high probability. We show that it is possible to efficiently simulate any statistical query algorithm reproducibly, but that the converse does not hold. We show that reproducibility implies differential privacy, and that reproducible algorithms permit data reuse under adaptive queries.

Russell Impagliazzo, Rex Lei, Jessica Sorrell
-
[ Visit Poster at Spot D0 in Virtual World ]

In analyzing data from multiple participants, such as the US Census, the privacy of sensitive information should be protected and the analysis needs to be robust against malicious participants injecting corrupted data. To this end, we study a fundamental statistical problem of estimating the covariance from i.i.d. samples of a zero mean Gaussian distribution. We provide the first efficient algorithm that achieves both robustness and privacy. We provide statistical guarantees showing that we pay an extra factor d in the sample complexity, compared to the minimax optimal sample complexity.

Logan Gnanapragasam, Jonathan Hayase, Sewoong Oh
-
[ Visit Poster at Spot B6 in Virtual World ]
Differentially private (DP) stochastic convex optimization (SCO) is a fundamental problem, where the goal is to approximately minimize the population risk with respect to a convex loss function, given a dataset of i.i.d. samples from a distribution, while satisfying differential privacy with respect to the dataset. Most of the existing works in the literature of private convex optimization focus on the Euclidean (i.e., $\ell_2$) setting, where the loss is assumed to be Lipschitz (and possibly smooth) w.r.t.~the $\ell_2$ norm over a constraint set with bounded $\ell_2$ diameter. Algorithms based on noisy stochastic gradient descent (SGD) are known to attain the optimal excess risk in this setting. In this work, we conduct a systematic study of DP-SCO for $\ell_p$-setups. For $p=1$, under a standard smoothness assumption, we give a new algorithm with nearly optimal excess risk. This result also extends to general polyhedral norms and feasible sets. For $p\in(1, 2)$, we give two new algorithms, for which a central building block is a novel privacy mechanism, which generalizes the Gaussian mechanism. Moreover, we establish a lower bound on the excess risk for this range of $p$, showing a necessary dependence on $\sqrt{d}$, where $d$ is the dimension of the space. Our lower bound implies a sudden transition of the excess risk at $p=1$, where the dependence on $d$ changes from logarithmic to polynomial, resolving an open question in prior work [TTZ15]. For $p\in (2, \infty)$, noisy SGD attains optimal excess risk in the low-dimensional regime; in particular, this proves the optimality of noisy SGD for $p=\infty$. Our work draws upon concepts from the geometry of normed spaces, such as the notions of regularity, uniform convexity, and uniform smoothness.
Raef Bassily, Cristobal Guzman, Anupama Nandi
-
[ Visit Poster at Spot A0 in Virtual World ]

Locally Differentially Private (LDP) Reports are commonly used for collection of statistics and machine learning in the federated setting. In many cases the best known LDP algorithms require sending prohibitively large messages from the client device to the server (such as when constructing histograms over large domain or learning a high-dimensional model). This has led to significant efforts on reducing the communication cost of LDP algorithms.

At the same time LDP reports are known to have relatively little information about the user's data due to randomization. Several schemes are known that exploit this fact to design low-communication versions of LDP algorithm but all of them do so at the expense of a significant loss in utility. Here we demonstrate a general approach that, under standard cryptographic assumptions, compresses every efficient LDP algorithm with negligible loss in privacy and utility guarantees. The practical implication of our result is that in typical applications the message can be compressed to the size of the server's pseudo-random generator seed. More generally, we relate the properties of an LDP randomizer to the power of a pseudo-random generator that suffices for compressing the LDP randomizer. From this general approach we derive low-communication algorithms for the problems of frequency estimation and high-dimensional mean estimation. Our algorithms are simpler and more accurate than existing low-communication LDP algorithms for these well-studied problems.

Vitaly Feldman, Kunal Talwar
-
[ Visit Poster at Spot A0 in Virtual World ]

A common pain point in differentially private machine learning is the significant runtime overhead incurred when executing Differentially Private Stochastic Gra- dient Descent (DPSGD), which may be as large as two orders of magnitude. We thoroughly demonstrate that by exploiting powerful language primitives, including vectorization, just-in-time compilation, and static graph optimization, one can dramatically reduce these overheads, in many cases nearly matching the best non- private running times. These gains are realized in two frameworks: one is JAX, which provides rich support for these primitives through the XLA compiler. We also rebuild core parts of TensorFlow Privacy, integrating more effective vector- ization as well as XLA compilation, granting significant memory and runtime improvements over previous release versions. Our proposed approaches allow us to achieve up to 50x speedups compared to the best alternatives.

Pranav Subramani, Nicholas Vadivelu, Gautam Kamath
-
[ Visit Poster at Spot C0 in Virtual World ]
We develop algorithms for private stochastic convex optimization that adapt to the hardness of the specific function we wish to optimize. While previous work provide worst-case bounds for arbitrary convex functions, it is often the case that the function at hand belongs to a smaller class that enjoys faster rates. Concretely, we show that for functions exhibiting $\kappa$-growth around the optimum, i.e., $f(x) \ge f(x^\star) + \lambda \kappa^{-1} \|x-x^\star\|_2^\kappa$ for $\kappa > 1$, our algorithms improve upon the standard ${\sqrt{d}}/{n\varepsilon}$ privacy rate to the faster $({\sqrt{d}}/{n\varepsilon})^{\tfrac{\kappa}{\kappa - 1}}$. Crucially, they achieve these rates without knowledge of the growth constant $\kappa$ of the function. Our algorithms build upon the inverse sensitivity mechanism, which adapts to instance difficulty [2], and recent localization techniques in private optimization [23]. We complement our algorithms with matching lower bounds for these function classes and demonstrate that our adaptive algorithm is simultaneously (minimax) optimal over all $\kappa \ge 1+c$ whenever $c = \Theta(1)$.
Hilal Asi, Daniel A Levy, John Duchi
-
[ Visit Poster at Spot C2 in Virtual World ]

The Randomized Response algorithm (RR) [Warner] is a classical technique to improve robustness in survey aggregation, and has been widely adopted in applications with differential privacy guarantees. We propose a novel algorithm, \emph{Randomized Response with Prior} (RRP), which can provide more accurate results while maintaining the same level of privacy guaranteed by RR. We then apply RRP to learn neural networks with \emph{label} differential privacy (LDP), and show that when only the label needs to be protected, the model performance can be significantly improved over the previous state-of-the-art private baselines. Moreover, we study different ways to obtain priors, which when used with RRP can additionally improve the model performance, further reducing the accuracy gap between private and non-private models. We complement the empirical results with theoretical analysis showing that LDP is provably easier than protecting both the inputs and labels.

Badih Ghazi, Noah Golowich, Ravi Kumar, Pasin Manurangsi, Chiyuan Zhang
-
[ Visit Poster at Spot A6 in Virtual World ]

Computing the noisy sum of real-valued vectors is an important primitive in differentially private learning and statistics. In private federated learning applications, these vectors are held by client devices, leading to a distributed summation problem. Standard Secure Multiparty Computation protocols for this problem are susceptible to poisoning attacks, where a client may have a large influence on the sum, without being detected.

In this work, we propose a poisoning-robust private summation protocol in the multiple-server setting, recently studied in PRIO. We present a protocol for vector summation that verifies that the Euclidean norm of each contribution is approximately bounded. We show that by relaxing the security constraint in SMC to a differential privacy like guarantee, one can improve over PRIO in terms of communication requirements as well as the client-side computation. Unlike SMC algorithms that inevitably cast integers to elements of a large finite field, our algorithms work over integers/reals, which may allow for additional efficiencies.

Kunal Talwar
-
[ Visit Poster at Spot B2 in Virtual World ]
We study the problem of unlearning datapoints from a learnt model. The learner first receives a dataset $S$ drawn i.i.d. from an unknown distribution, and outputs a model $w$ that performs well on unseen samples from the same distribution. However, at some point in the future, any training datapoint $z \in S$ can request to be unlearned, thus prompting the learner to modify its output model while still ensuring the same accuracy guarantees. We initiate a rigorous study of generalization in machine unlearning, where the goal is to perform well on previously unseen datapoints. Our focus is on both computational and storage complexity. For the setting of convex losses, we provide an unlearning algorithm that can unlearn up to $O(n/d^{1/4})$ samples, where $d$ is the problem dimension. In comparison, in general, differentially private learning (which implies unlearning) only guarantees deletion of $O(n/d^{1/2})$ samples. This demonstrates a novel separation between differential privacy and machine unlearning.
Ayush Sekhari, Ayush Sekhari, Jayadev Acharya, Gautam Kamath, Ananda Theertha Suresh
-
[ Visit Poster at Spot A4 in Virtual World ]

Bayesian neural network (BNN) allows for uncertainty quantification in prediction, offering an advantage over regular neural networks that has not been explored in the differential privacy (DP) framework. We fill this important gap by leveraging recent development in Bayesian deep learning and privacy accounting to offer a more precise analysis of the trade-off between privacy and accuracy in BNN. We propose three DP-BNNs that characterize the weight uncertainty for the same network architecture in distinct ways, namely DP-SGLD (via the noisy gradient method), DP-BBP (via changing the parameters of interest) and DP-MC Dropout (via the model architecture). Interestingly, we show a new equivalence between DP-SGD and DP-SGLD, implying that some non-Bayesian DP training naturally allows for uncertainty quantification. However, the hyperparameters such as learning rate and batch size, can have different or even opposite effects in DP and non-DP training.

Extensive experiments show that the sampling method (DP-SGLD) significantly outperforms optimization methods (DP-BBP and DP-MC Dropout) in terms of privacy guarantee, prediction accuracy, uncertainty quantification, computation speed, and generalizability. When compared to non-DP and non-Bayesian approaches, DP-SGLD loses remarkably little performance, demonstrating the great potential of DP-BNN in real tasks.

Woody Bu, Qiyiwen Zhang, Kan Chen, Qi Long
-
[ Visit Poster at Spot C5 in Virtual World ]

We study private synthetic data generation for query release, where the goal is to construct a sanitized version of a sensitive dataset, subject to differential privacy, that approximately preserves the answers to a large collection of statistical queries. We first present an algorithmic framework that unifies a long line of iterative algorithms in the literature. Under this framework, we propose two new methods. The first method, private entropy projection (PEP), can be viewed as an advanced variant of MWEM that adaptively reuses past query measurements to boost accuracy. Our second method, generative networks with the exponential mechanism (GEM), circumvents computational bottlenecks in algorithms such as MWEM and PEP by optimizing over generative models parameterized by neural networks, which capture a rich family of distributions while enabling fast gradient-based optimization. We demonstrate that PEP and GEM empirically outperform existing algorithms. Furthermore, we show that GEM nicely incorporates prior information from public data while overcoming limitations of PMW^Pub, the existing state-of-the-art method that also leverages public data.

Terrance Liu, Giuseppe Vietri, Steven Wu
-
[ Visit Poster at Spot C1 in Virtual World ]

We initiate a study of the composition properties of interactive differentially private mechanisms. An interactive differentially private mechanism is an algorithm that allows an analyst to adaptively ask queries about a sensitive dataset, with the property that an adversarial analyst's view of the interaction is approximately the same regardless of whether or not any individual's data is in the dataset. Previous studies of composition of differential privacy have focused on non-interactive algorithms, but interactive mechanisms are needed to capture many of the intended applications of differential privacy and a number of the important differentially private primitives. We focus on concurrent composition, where an adversary can arbitrarily interleave its queries to several differentially private mechanisms, which may be feasible when differentially private query systems are deployed in practice.
We prove that when the interactive mechanisms being composed are pure differentially private, their concurrent composition achieves privacy parameters (with respect to pure or approximate differential privacy) that match the (optimal) composition theorem for noninteractive differential privacy. We also prove a composition theorem for interactive mechanisms that satisfy approximate differential privacy. That bound is weaker than even the basic (suboptimal) composition theorem for noninteractive differential privacy, and we leave closing the gap as a direction for future research, along with understanding concurrent composition for other variants of differential privacy.

Salil Vadhan, Tianhao Wang
-
[ Visit Poster at Spot D2 in Virtual World ]
Most works in learning with differential privacy (DP) have focused on the setting where each user has a single sample. In this work, we consider the setting where each user holds $m$ samples and the privacy protection is enforced at the level of each user's data. We show that, in this setting, we may learn with a much fewer number of users. Specifically, we show that, as long as each user receives sufficiently many samples, we can learn any privately learnable class via an $(\epsilon, \delta)$-DP algorithm using only $O(\log(1/\delta)/\epsilon)$ users. For $\epsilon$-DP algorithms, we show that we can learn using only $O_{\epsilon}(d)$ users even in the local model, where $d$ is the probabilistic representation dimension. In both cases, we show a nearly-matching lower bound on the number of users required. A crucial component of our results is a generalization of \emph{global stability} [Bun et al., 2020] that allows the use of public randomness. Under this relaxed notion, we employ a correlated sampling strategy to show that the global stability can be boosted to be arbitrarily close to one, at a polynomial expense in the number of samples.
Badih Ghazi, Ravi Kumar, Pasin Manurangsi
-
[ Visit Poster at Spot B4 in Virtual World ]

Providing privacy guarantees has been one of the primary motivations of Federated Learning (FL). However, to guarantee the client-level differential privacy (DP) in FL algorithms, the clients' transmitted model updates have to be clipped before adding privacy noise. Such clipping operation is substantially different from its counterpart in the centralized differentially private SGD and has not been well-understood. In this paper, we first empirically demonstrate that the clipped FedAvg can perform surprisingly well even with substantial data heterogeneity when training neural networks, which is partly because the clients' updates become similar for several popular deep architectures. Based on this key observation, we provide the convergence analysis of a DP FedAvg algorithm and highlight the relationship between clipping bias and the distribution of the clients' updates. To the best of our knowledge, this is the first work that rigorously investigates theoretical and empirical issues regarding the clipping operation in FL algorithms.

xinwei zhang, Xiangyi Chen, Steven Wu, Mingyi Hong
-
[ Visit Poster at Spot D4 in Virtual World ]
Characterizing the privacy degradation over compositions, i.e., privacy accounting, is a fundamental topic in differential privacy (DP) with many applications to differentially private machine learning and federated learning. We propose a unification of recent advances (Renyi DP, privacy profiles, $f$-DP and the PLD formalism) via the \emph{characteristic function} ($\phi$-function) of a certain ``worst-case'' privacy loss random variable. We show that our approach allows \emph{natural} adaptive composition like Renyi DP, provides \emph{exactly tight} privacy accounting like PLD, and can be (often \emph{losslessly} converted to privacy profile and $f$-DP, thus providing $(\epsilon,\delta)$-DP guarantees and interpretable tradeoff functions. Algorithmically, we propose an \emph{analytical Fourier accountant} that represents the \emph{complex} logarithm of $\phi$-functions symbolically and uses Gaussian quadrature for numerical computation. On several popular DP mechanisms and their subsampled counterparts, we demonstrate the flexibility and tightness of our approach in theory and experiments.
Yuqing Zhu, Jinshuo Dong, Yu-Xiang Wang
-
[ Visit Poster at Spot D3 in Virtual World ]
Estimating the quantiles of a large dataset is a fundamental problem in both the streaming algorithms literature and the differential privacy literature. However, all existing private mechanisms for distribution-independent quantile computation require space polynomial in the input size $n$. In this work, we devise a differentially private algorithm for the quantile estimation problem with strongly sublinear space complexity. Our main mechanism estimates any $\alpha n$-approximate quantile of a length-$n$ stream over a data universe $\mathcal{X}$ with probability $1-\beta$ using $\tilde{O}\left( \frac{\log |\mathcal{X}| \log n}{\alpha \epsilon \beta}\right)$ space while satisfying $\epsilon$-differential privacy. Our mechanisms build upon the deterministic streaming algorithm of Greenwald and Khanna (2001) for non-private quantile estimation, instantiating the exponential mechanism using a utility function defined on sketch items, while (privately) sampling from intervals defined by the sketch. We also present another algorithm based on histograms that is especially suited to the multiple quantiles case. We implement our algorithms and experimentally evaluate them on synthetic and real-world datasets.
Daniel Alabi, Omri Ben-Eliezer, Anamay Chaturvedi
-
[ Visit Poster at Spot B3 in Virtual World ]

Differential privacy is the gold standard when it comes to facilitating data analysis with strong privacy guarantees. The past decade has witnessed the design of numerous algorithms for differentially private graph analysis. These algorithms offer different privacy guarantees, and require different parameters and configurations for their usage. Some of these algorithms also suffer from poor scalability and are not equipped to handle large datasets. The combination of these factors makes determining the optimal algorithmic choice for a given scenario a non-trivial affair.

In this paper, we examine the trade-offs between the accuracy and performance of various classes of differentially private graph analysis algorithms by benchmarking them on real-world datasets. Our evaluation demonstrates that the optimal choice of algorithm is highly dependent on the properties of the underlying dataset as well as the performance and privacy requirements of a given user. This may explain why, despite a wealth of differentially private algorithms being available for graph analysis, their usage is not yet prevalent. We therefore urge the adoption of a standardized benchmarking platform, to facilitate the design and implementation of differentially private graph algorithms.

Huiyi Ning, Sreeharsha Udayashankar, Sara Qunaibi, Karl Knopf, Xi He
-
[ Visit Poster at Spot B1 in Virtual World ]

The sequential hypothesis testing problem is a class of statistical analyses where the sample size is not fixed in advance, and the analyst must make real-time decisions until a stopping criterion is reached. In this work, we study the sequential hypothesis testing problem under the constraint of Renyi differential privacy for the sample. Unlike previous work in private hypothesis testing that focuses on the classical fixed sample setting, our results may allow a conclusion to be reached much earlier, and thus saves the cost of collecting these additional samples. We present a new private algorithm based on Wald's Sequential Probability Ratio Test (SPRT) that gives strong theoretical privacy guarantees. We provide theoretical analysis on statistical performance measured by Type I and Type II error as well as the expected sample size, and we also provide empirical validation of our results.

Wanrong Zhang, Yajun Mei, Rachel Cummings
-
[ Visit Poster at Spot C1 in Virtual World ]

A large body of research has shown that complex machine learning models are vulnerable to membership inference attacks. Research on membership inference has so far focused on the case of a single standalone model, while real machine learning pipelines typically update models over time, giving the attacker more information. We show that attackers can exploit this information to carry out more powerful membership inference attacks than they could if they only had access to a single model. Our main contributions are to formalize membership inference attacks in the setting of model updates, suggest new attack strategies to exploit model updates, and validate these strategies both theoretically and empirically.

Matthew Jagielski, Stanley Wu, Alina Oprea, Jonathan Ullman, Roxana Geambasu
-
[ Visit Poster at Spot B3 in Virtual World ]

When differentially private statistics are used for decision-making for numerous units (e.g., jurisdictions), some units can be effectively randomly assigned to interventions, even in the presence of an ostensibly deterministic policy. Here we develop methods for using this randomness for causal inference about the effects of those interventions. In particular, we characterize deterministic policies applied to differentially private data as corresponding to stochastic policies applied to the original private data. We develop inverse-probability-weighted estimators, which make use of the probability with which a unit is assigned to treatments by the privacy mechanism, and versions of these estimators that are themselves differentially private. These estimators can have substantial advantages over other methods, such as regression discontinuity analyses, that avoid using these probabilities of assignment. Using the U.S. Census Bureau's American Community Survey, we illustrate a potential application to how translated voter information affects voter turnout.

Leon Yao, Naoise Holohan, David Arbour, Dean Eckles
-
[ Visit Poster at Spot D3 in Virtual World ]
The Wasserstein distance, rooted in optimal transport (OT) theory, is a popular discrepancy measure between probability distributions with various applications to statistics and machine learning. Despite their rich structure and demonstrated utility, Wasserstein distances are sensitive to outliers in the considered distributions, which hinders applicability in practice. This motivates us to propose a new outlier-robust Wasserstein distance, denoted $W^{\delta}_p$, that identifies probability measures within total variation distance $\delta$ of each other. We conduct a thorough theoretical study of $W^{\delta}_p$, encompassing characterization of optimal perturbations, regularity, duality, and statistical results. In particular, we derive a remarkably simple dual form for $W^{\delta}_p$ that lends itself to implementation via an elementary modification to standard, duality-based OT solvers. We also reveal connections between robust OT and Pufferfish privacy (PP), a generalization of differential privacy, demonstrating that $W^{\delta}_{\infty}$ naturally generalizes the relation between $W_{\infty}$ and $(\epsilon,0)$-PP to the $(\epsilon,\delta)$-privacy regime.
Sloan Nietert, Rachel Cummings, Ziv Goldfeld
-
[ Visit Poster at Spot D4 in Virtual World ]

Hyperparameter optimization is a ubiquitous challenge in machine learning, and the performance of a trained model depends crucially upon their effective selection. While a rich set of tools exist for this purpose, there are currently no practical hyperparameter selection methods under the constraint of differential privacy (DP). We study honest hyperparameter selection for differentially private machine learning, in which the process of hyperparameter tuning is accounted for in the overall privacy budget. To this end, we i) show that standard composition tools outperform more advanced techniques in many settings, ii) empirically and theoretically demonstrate an intrinsic connection between the learning rate and clipping norm iii) show that adaptive optimizers like DPAdam enjoy a significant advantage in the process of honest hyperparameter tuning, and iv) draw upon novel limiting behaviour of Adam in the DP setting to design a new and more efficient optimizer.

Shubhankar Mohapatra, Shubhankar Mohapatra, Sajin Sasy, Gautam Kamath, Xi He, Om Thakkar
-
[ Visit Poster at Spot B5 in Virtual World ]

Differential privacy is a restriction on data processing algorithms that provides strong confidentiality guarantees for individual records in the data. However, research on proper statistical inference, that is, research on properly quantifying the uncertainty of the (noisy) sample estimate regarding the true value in the population, is currently still limited. This paper proposes and evaluates several strategies to compute valid differentially private confidence intervals for the median. Instead of computing a differentially private point estimate and deriving its uncertainty, we directly estimate the interval bounds and discuss why this approach is superior if ensuring privacy is important. We also illustrate that addressing both sources of uncertainty--the error from sampling and the error from protecting the output--simultaneously should be preferred over simpler approaches that incorporate the uncertainty in a sequential fashion. We evaluate the performance of the different algorithms under various parameter settings in extensive simulation studies and demonstrate how the findings could be applied in practical settings using data from the 1940 Decennial Census.

Joerg Drechsler, Ira Globus-Harris, Audra McMillan, Adam Smith, Jayshree Sarathy
-
[ Visit Poster at Spot A4 in Virtual World ]

In deep learning with differential privacy (DP), the neural network achieves the privacy usually at the cost of slower convergence (and thus lower performance) than its non-private counterpart. This work gives the first convergence analysis of the DP deep learning, through the lens of training dynamics and the neural tangent kernel (NTK). Our convergence theory successfully characterizes the effects of two key components in the DP training: the per-sample clipping (flat or layerwise) and the noise addition. Our analysis not only initiates a general principled framework to understand the DP deep learning with any network architecture and loss function, but also motivates a new clipping method -- the \textit{global clipping}, that significantly improves the convergence while preserving the same privacy guarantee as the existing \textit{local clipping}.

Woody Bu, Hua Wang, Qi Long, Weijie Su
-
[ Visit Poster at Spot A3 in Virtual World ]

We study the Renyi Differential Privacy (RDP) for general discrete local mechanisms in the {\em shuffle} privacy model, where clients randomize their responses using a local differentially private (LDP) mechanism and the untrusted server only receives a random permutation (shuffle) of the client responses without association to each client. The principal result in this paper is the \emph{first} non-trivial RDP guarantee for general discrete local randomization mechanisms in the shuffled privacy model, and we develop new analysis techniques for deriving our results which could be of independent interest. In applications, such an RDP guarantee is most useful when we use it for composing several private interactions. We numerically demonstrate that, for important regimes, with composition our bound yields a significant improvement in privacy guarantee over the state-of-the-art approximate Differential Privacy (DP) guarantee (with standard composition) for shuffled models.

Antonious M Girgis, Deepesh Data, Suhas Diggavi, Ananda Theertha Suresh, Peter Kairouz
-
Recent work of Erlingsson, Feldman, Mironov, Raghunathan, Talwar, and Thakurta [EFMRTT19] demonstratesthat random shuffling amplifies differential privacy guarantees of locally randomized data. Such amplification impliessubstantially stronger privacy guarantees for systems in which data is contributed anonymously [BEMMRLRKTS17] and has lead to significant interest in the shuffle model of privacy [CSUZZ19; EFMRTT19]. We show that random shuffling of $n$ data records that are input to $\epsilon_0$-differentially private local randomizers results in an $(O((1-e^{-\epsilon_0})\sqrt{\frac{e^{\epsilon_0}\log(1/\delta)}{n}}), \delta)$-differentially private algorithm. This significantly improves over previous work and achieves the asymptotically optimal dependence in $\epsilon_0$. Our result is based on a new approach that is simpler than previous work and extends to approximate differential privacy with nearly the same guarantees. Our work also yields an empirical method to derive tighter bounds the resulting $\epsilon$ and we show that it gets to within a small constant factor of the optimal bound. It also naturally extends to an empirical method to provide bounds for Renyi differential privacy in the shuffle model.
Vitaly Feldman, Audra McMillan, Kunal Talwar
-
[ Visit Poster at Spot B5 in Virtual World ]

In this work, we leverage recent advances in time-uniform supermartingale concentration to provide a unified analysis for advanced privacy composition, privacy filters, and privacy odometers. As a consequence, we are able to construct improved privacy filters and odometers for fully adaptive private data analysis.

Justin Whitehouse, Aaditya Ramdas, Ryan Rogers, Steven Wu
-
[ Visit Poster at Spot A5 in Virtual World ]

A key challenge for data analysis in the federated setting is that user data is heterogeneous, i.e., it cannot be assumed to be sampled from the same distribution. Further, in practice, different users may possess vastly different number of samples. In this work we propose a simple model of heterogeneous user data that differs in both distribution and quantity of data, and we provide a method for estimating the population-level mean while preserving user-level differential privacy. We demonstrate asymptotic optimality of our estimator within a natural class of private estimators and also prove general lower bounds on the error achievable in our problem. In particular, while the optimal non-private estimator can be shown to be linear, we show that privacy constrains us to use a non-linear estimator.

Rachel Cummings, Vitaly Feldman, Audra McMillan, Kunal Talwar
-
[ Visit Poster at Spot B1 in Virtual World ]
In this work, we design differentially private hypothesis tests for the following problems in the general linear model: testing a linear relationship and testing for the presence of mixtures. The majority of our hypothesis tests are based on differentially private versions of the $F$-statistic for the general linear model framework, which are uniformly most powerful unbiased in the non-private setting. We also present another test for testing mixtures, based on the differentially private nonparametric tests of Couch, Kazan, Shi, Bray, and Groce (CCS 2019), which is especially suited for the small dataset regime. Through a suite of Monte Carlo based experiments, we show that our tests achieve desired \textit{significance levels} and have a high \textit{power} that approaches the power of the non-private tests as we increase sample sizes or the privacy-loss parameter.
Daniel Alabi, Salil Vadhan
-
[ Visit Poster at Spot B0 in Virtual World ]

We present a method for producing unbiased parameter estimates and valid confidence sets under the constraints of differential privacy. Prior work in this area is limited in that it is tailored to calculating confidence intervals for specific statistical procedures, such as mean estimation or simple linear regression. While other recent work can produce confidence intervals for more general sets of procedures, they either yield only approximately unbiased estimates, are designed for one-dimensional outputs, or assume significant user knowledge about the data-generating distribution. In contrast, our method uses the CoinPress algorithm in tandem with the Sample-Aggregate framework to produce estimates from general high-dimensional estimators that are, with high probability, unbiased and have valid confidence sets. These theoretical guarantees hold provided that the estimator, when applied over subsets of a random partition of the original data, produces estimates following a multivariate Gaussian distribution. We also propose improvements to the existing CoinPress algorithm which we find lead to more accurate estimates in practice.

Christian Covington, Xi He, James Honaker, Gautam Kamath
-
[ Visit Poster at Spot A4 in Virtual World ]

Privacy issues arise in a number of computational tasks, notably in learning problems. In this context, differentially private mechanisms ensure that the final output does not depend too heavily on any one input or specific training example. The problem of private learning has been extensively studied. In particular, a striking equivalence between local differentially private algorithms and statistical query algorithms has been shown in [Kas+11]. In addition, the statistical query model has been recently extended to quantum computation in [AGY20]. In this work, we give a formal definition of quantum local differential privacy and we extend the aforementioned result of [Kas+11], addressing an open problem posed in [AQS21].

Armando Angrisani, Elham Kashefi
-
[ Visit Poster at Spot D1 in Virtual World ]

We consider training models with differential privacy (DP) using mini-batch gradients. The existing state-of-the-art, Differentially Private Stochastic Gradient Descent (DP-SGD), requires \emph{privacy amplification by sampling or shuffling} to obtain the best privacy/accuracy/computation trade-offs. Unfortunately, the precise requirements on exact sampling and shuffling can be hard to obtain in important practical scenarios, particularly federated learning (FL). We design and analyze a DP variant of Follow-The-Regularized-Leader (DP-FTRL) that compares favorably (both theoretically and empirically) to amplified DP-SGD, while allowing for much more flexible data access patterns. DP-FTRL does not use any form of privacy amplification.

Peter Kairouz, Hugh B McMahan, Shuang Song, Om Thakkar, Abhradeep Guha Thakurta, Zheng Xu
-
[ Visit Poster at Spot C5 in Virtual World ]
As part of the 2020 decennial census, the US Census Bureau has developed a new approach to disclosure avoidance, based on differential privacy, called the TopDown Algorithm. The first results to use this new approach will be the Public Law 74 redistricting file, and the Census Bureau recently released a demonstration of their algorithm on data from the 2010 census using a privacy loss budget of $\epsilon=12.2$. We conducted a simulation study to investigate the risk of re-identification in data like this. We reconstructed a microdata file based on the 2010 decennial census and used this to simulate a commercial database and a test database. We used exact record linkage between the simulated commercial database and the demonstration data (as well as simulated demonstrate data with larger and smaller privacy loss budgets) and measured the putative and confirmed re-identifications to investigate how much protection a range of values of $\epsilon$ might provide. In our simulation, we found 40.5 million putative re-identifications among the total US population, of which 38.6 million were confirmed. Among individuals who are not of the majority race/ethnicity of their census tract, we found 2.5 million putative re-identifications, of which 0.9 million were confirmed. Balancing the trade-off between privacy and accuracy is a policy decision, and we hope that this analysis can provide some additional evidence to inform stakeholders.
Abraham Flaxman
-
[ Visit Poster at Spot C3 in Virtual World ]
We show a generic reduction from multiclass differentially private PAC learning to binary private PAC learning. We apply this transformation to a recently proposed binary private PAC learner to obtain a private multiclass learner with sample complexity that has a polynomial dependence on the multiclass Littlestone dimension and a poly-logarithmic dependence on the number of classes. This yields an exponential improvement in the dependence on both parameters over learners from previous work. Our proof extends the notion of $\Psi$-dimension defined in work of Ben-David et al. \cite{Ben} to the online setting and explores its general properties.
Satchit Sivakumar, Mark Bun, Marco Gaboradi
-
[ Visit Poster at Spot B5 in Virtual World ]

Balancing privacy and accuracy is a major challenge in designing differentially private machine learning algorithms. One way to improve this tradeoff for free is to leverage the noise in common data operations that already use randomness. Such operations include noisy SGD and data subsampling. The additional noise in these operations may amplify the privacy guarantee of the overall algorithm, a phenomenon known as {\em{privacy amplification}}. In this paper, we
analyze the privacy amplification of sampling from a multidimensional Bernoulli distribution family given the parameter from a private algorithm. This setup has applications to Bayesian inference and to data compression. We provide an algorithm to compute the amplification factor, and we establish upper and lower bounds on this factor.

Jacob Imola, Kamalika Chaudhuri
-
[ Visit Poster at Spot A1 in Virtual World ]

We present a new mechanism for label differential privacy, a relaxation of differentially private machine learning that only protects the privacy of the labels in the training set. Our mechanism clusters the examples in the training set using their (non-private) feature vectors, randomly re-samples each label from examples in the same cluster, and outputs a training set with noisy labels as well as a modified version of the true loss function. We prove that when the clusters are both large and high-quality, the model that minimizes the modified loss on the noisy training set converges to small excess risk at a rate that is comparable to the rate for non-private learning. Our experiments show that randomizing the labels within each cluster significantly improves the privacy vs. accuracy trade-off compared to applying uniform randomized response to the labels, and also compared to learning a model via DP-SGD.

Hossein Esfandiari, Vahab Mirrokni, Umar Syed, Sergei Vassilvitskii
-
[ Visit Poster at Spot D3 in Virtual World ]
We propose and analyze algorithms to solve a range of learning tasks under user-level differential privacy constraints. Rather than guaranteeing only the privacy of individual samples, user-level DP protects a user's entire contribution ($m \ge 1$ samples), providing more stringent but more realistic protection against information leaks. We show that for high-dimensional mean estimation, empirical risk minimization with smooth losses, stochastic convex optimization, and learning hypothesis classes with finite metric entropy, the privacy cost decreases as $O(1/\sqrt{m})$ as users provide more samples. In contrast, when increasing the number of users $n$, the privacy cost decreases at a faster $O(1/n)$ rate. We complement these results with lower bounds showing the minimax optimality of our algorithms for mean estimation and stochastic convex optimization. Our algorithms rely on novel techniques for private mean estimation in arbitrary dimension with error scaling as the concentration radius $\tau$ of the distribution rather than the entire range.
Daniel A Levy, Ziteng Sun, Kareem Amin, Satyen Kale, Alex Kulesza, Mehryar Mohri, Ananda Theertha Suresh
-
[ Visit Poster at Spot C0 in Virtual World ]

In this paper, we focus on differentially private point and confidence interval estimators for simple linear regression. Motivated by recent work that highlights the strong empirical performance of an algorithm based on robust statistics, DPTheilSen, we provide a theoretical analysis of its privacy and accuracy properties, offer guidance on setting hyperparameters, and show how to produce non-parametric, differentially private confidence intervals to accompany its point estimates.

Jayshree Sarathy, Salil Vadhan
-
[ Visit Poster at Spot D6 in Virtual World ]

\ldp deployments are vulnerable to inference attacks as an adversary can link the noisy responses to their identity and subsequently, auxiliary information using the \textit{order} of the data. An alternative model, shuffle \textsf{DP}, prevents this by shuffling the noisy responses uniformly at random. However, this limits the data learnability -- only symmetric functions (input order agnostic) can be learned. In this paper, we strike a balance and propose a generalized shuffling framework that interpolates between the two deployment models. We show that systematic shuffling of the noisy responses can thwart specific inference attacks while retaining some meaningful data learnability. To this end, we propose a novel privacy guarantee, \name-privacy, that captures the privacy of the order of a data sequence. \name-privacy allows tuning the granularity at which the ordinal information is maintained, which formalizes the degree the resistance to inference attacks trading it off with data learnability. Additionally, we propose a novel shuffling mechanism that can achieve \name-privacy and demonstrate the practicality of our mechanism via evaluation on real-world datasets.

Casey M Meehan, Amrita Roy Chowdhury, Kamalika Chaudhuri, Somesh Jha
-
[ Visit Poster at Spot C4 in Virtual World ]

Protecting the privacy of people whose data is used by machine learning algorithms is important. Differential Privacy is the appropriate mathematical framework for formal guarantees of privacy, and boosted decision trees are a popular machine learning technique. So we propose and test a practical algorithm for boosting decision trees that guarantees differential privacy. Privacy is enforced because our booster never puts too much weight on any one example; this ensures that each individual's data never influences a single tree "too much." Experiments show that this boosting algorithm can produce better model sparsity and accuracy than other differentially private ensemble classifiers.

Marco Carmosino, Vahid Reza Asadi, Mohammad Mahdi Jahanara, Akbar Rafiey, Bahar Salamatian
-
[ Visit Poster at Spot A2 in Virtual World ]

Keystroke dynamics is a behavioural biometric form of authentication based on the inherent typing behaviour of an individual. While this technique is gaining traction, protecting the privacy of the users is of the utmost importance. Fully Homomorphic Encryption is a technique that allows performing computation on encrypted data, which enables processing of sensitive data. FHE is also known to be "future-proof" since it is a lattice-based cryptosystem that is regarded as quantum-safe. It has seen significant performance improvements over the years with substantially increased developer-friendly tools. We propose a neural network for keystroke analysis trained using differential privacy to speed up training while preserving privacy and predicting on encrypted data using FHE to keep the users' privacy intact while having sufficient usability.

Jatan Loya, Tejas Bana
-
[ Visit Poster at Spot A3 in Virtual World ]

Joint distribution estimation of a dataset under differential privacy is a fundamental problem for many privacy-focused applications, such as query answering, machine learning tasks and synthetic data generation. In this work, we examine the joint distribution estimation problem given two data points: 1) differentially private answers of a workload computed over private data and 2) a prior empirical distribution from a public dataset. Our goal is to find a new distribution such that estimating the workload using this distribution is as accurate as the differentially private answer, and the relative entropy, or KL divergence, of this distribution is minimized with respect to the prior distribution. We propose an approach based on iterative optimization for solving this problem. An application of our solution won second place in the NIST 2020 Differential Privacy Temporal Map Challenge, Sprint 2.

Yuchao Tao, Johes Bater, Ashwin Machanavajjhala
-
[ Visit Poster at Spot B0 in Virtual World ]
We study personalization of supervised learning with user-level differential privacy. Consider a setting with many users, each of whom has a training data set drawn from their own distribution $P_i$. Assuming some shared structure among the problems $P_i$, can users collectively learn the shared structure---and solve their tasks better than they could individually---while preserving the privacy of their data? We formulate this question using joint, _user-level_ differential privacy---that is, we control what is leaked about each user's entire data set. We provide algorithms that exploit popular non-private approaches in this domain like the Almost-No-Inner-Loop (ANIL) method, and give strong user-level privacy guarantees for our general approach. When the problems $P_i$ are linear regression problems with each user's regression vector lying in a common, unknown low-dimensional subspace, we show that our efficient algorithms satisfy nearly optimal estimation error guarantees. We also establish a general, information-theoretic upper bound via an exponential mechanism-based algorithm.
Prateek Jain, J K Rush, Adam Smith, Shuang Song, Abhradeep Guha Thakurta
-
[ Visit Poster at Spot C3 in Virtual World ]
We revisit the problem of counting the number of distinct elements $\dist$ in a data stream $D$, over a domain $[u]$. We propose an $(\epsilon,\delta)$-differentially private algorithm that approximates $\dist$ within a factor of $(1\pm\gamma)$, and with additive error of $O(\sqrt{\ln(1/\delta)}/\epsilon)$, using space $O(\ln(\ln(u)/\gamma)/\gamma^2)$. We improve on the prior work at least quadratically and up to exponentially, in terms of both space and additive error. Our additive error guarantee is optimal up to a factor of $O(\sqrt{\ln(1/\delta)})$, and the space bound is optimal up to a factor of $O\left(\min\left\{\ln\left(\frac{\ln(u)}{\gamma}\right), \frac{1}{\gamma^2}\right\}\right)$. We assume the existence of an ideal uniform random hash function, and ignore the space required to store it. We later relax this requirement by assuming pseudorandom functions and appealing to a computational variant of differential privacy, SIM-CDP. Our algorithm is built on top of the celebrated Flajolet-Martin (FM) sketch. We show that FM-sketch is differentially private as is, as long as there are $\approx \sqrt{\ln(1/\delta)}/(\epsilon\gamma)$ distinct elements in the data set. Along the way, we prove a structural result showing that the maximum of $k$ i.i.d. random variables is statistically close (in the sense of $\epsilon$-differential privacy) to the maximum of $(k+1)$ i.i.d. samples from the same distribution, as long as $k=\Omega\left(\frac{1}{\epsilon}\right)$. Finally, experiments show that our algorithms introduces error within an order of magnitude of the non-private analogues for streams with thousands of distinct elements, even while providing strong privacy guarantee ($\eps\leq 1$).
Adam Smith, Shuang Song, Abhradeep Guha Thakurta
-

Data deletion algorithms aim to remove the influence of deleted data points from trained models at a cheaper computational cost than fully retraining those models. However, for sequences of deletions, most prior work in the non-convex setting gives valid guarantees only for sequences that are chosen independently of the models that are published. If people choose to delete their data as a function of the published models (because they don't like what the models reveal about them, for example), then the update sequence is adaptive. In this paper, we give a general reduction from deletion guarantees against adaptive sequences to deletion guarantees against non-adaptive sequences, using differential privacy and its connection to max information. Combined with ideas from prior work which give guarantees for non-adaptive deletion sequences, this leads to extremely flexible algorithms able to handle arbitrary model classes and training methodologies, giving strong provable deletion guarantees for adaptive deletion sequences. We show in theory how prior work for non-convex models fails against adaptive deletion sequences, and use this intuition to design a practical attack against the SISA algorithm of Bourtoule et al. [2021] on CIFAR-10, MNIST, Fashion-MNIST.

Varun Gupta, Christopher Jung, Seth Neel, Aaron Roth, Saeed Sharifi-Malvajerdi, Chris Waites
-
[ Visit Poster at Spot C4 in Virtual World ]

Modern machine learning models are complex and frequently encode surprising amounts of information about individual inputs. In extreme cases, complex models appear to memorize entire input examples, including seemingly irrelevant information (social security numbers from text, for example). In this paper, we aim to understand whether this sort of memorization is necessary for accurate learning. We describe natural prediction problems in which every sufficiently accurate training algorithm must encode, in the prediction model, essentially all the information about a large subset of its training examples. This remains true even when the examples are high-dimensional and have entropy much higher than the sample size, and even when most of that information is ultimately irrelevant to the task at hand. Further, our results do not depend on the training algorithm or the class of models used for learning.

Our problems are simple and fairly natural variants of the next-symbol prediction and the cluster labeling tasks. These tasks can be seen as abstractions of text- and image-related prediction problems. To establish our results, we reduce from a family of one-way communication problems for which we prove new information complexity lower bounds.

Gavin Brown, Mark Bun, Vitaly Feldman, Adam Smith, Kunal Talwar
-
We initiate an investigation of private sampling from distributions. Given a dataset with $n$ independent observations from an unknown distribution $P$, a sampling algorithm must output a single observation from a distribution that is close in total variation distance to $P$ while satisfying differential privacy. Sampling abstracts the goal of generating small amounts of realistic-looking data. We provide upper and lower bounds for the dataset size needed for this task for two natural families of distributions: arbitrary distributions on $\{1,\ldots ,k\}$ and product distributions on $\{0,1\}^d$. We demonstrate that, in some parameter regimes, private sampling requires asymptotically fewer observations than learning a description of $P$ nonprivately; in other regimes, however, sampling proves to be as difficult as private learning.
Satchit Sivakumar, Marika Swanberg, Sofya Raskhodnikova, Adam Smith
-
[ Visit Poster at Spot D2 in Virtual World ]
Shuffle model of differential privacy is a novel distributed privacy model based on a combination of local privacy mechanisms and a trusted shuffler. It has been shown that the additional randomisation provided by the shuffler improves privacy bounds compared to the purely local mechanisms. Accounting tight bounds, especially for multi-message protocols, is complicated by the complexity brought by the shuffler. The recently proposed Fourier Accountant for evaluating $(\varepsilon,\delta)$-differential privacy guarantees has been shown to give tighter bounds than commonly used methods for non-adaptive compositions of various complex mechanisms. In this work we show how to compute tight privacy bounds using the Fourier Accountant for multi-message versions of several ubiquitous mechanisms in the shuffle model and demonstrate looseness of the existing bounds in the literature.
Antti Koskela, Mikko A Heikkilä, Antti Honkela
-
[ Visit Poster at Spot C0 in Virtual World ]

We present DP-HMC, a variant of Hamiltonian Monte Carlo (HMC) that is differentially private (DP). We use the penalty algorithm of Yildirim et al. to make the acceptance test private, and add Gaussian noise to the gradients of the target distribution to make the HMC proposal private. Our main contribution is showing that DP-HMC has the correct invariant distribution, and is ergodic. We also compare DP-HMC with the existing penalty algorithm, as well as DP-SGLD and DP-SGNHT.

Ossi Räisä, Antti Koskela, Antti Honkela
-
[ Visit Poster at Spot D5 in Virtual World ]
Classification is one of the most important tasks involving data. Traditional approaches for classification use the framework of empirical risk minimization on a convex surrogate loss. Due to the importance of this problem, differentially private empirical risk minimization has been subject of intense research effort dating back to the influential work of Chaudhuri et al. In this short paper, we propose an alternate approach to differentially private classification, based on the $0-1$ loss function. In theory, this method offers a very appealing alternative under differential privacy requirements. We argue that differentially private classification is no harder than the canonical ``most common medical condition'' problem, which is easily solved by the exponential mechanism. In practice, this approach works remarkably well, although there are challenges to implementing it exactly for high-dimensional data.
Ryan McKenna
-
[ Visit Poster at Spot B1 in Virtual World ]
We study stochastic convex optimization with heavy-tailed data under the constraint of differential privacy. Most prior work on this problem is restricted to the case where the loss function is Lipschitz. Instead, as introduced by Wang, Xiao, Devadas, and Xu~\cite{WangXDX20}, we study general convex loss functions with the assumption that the distribution of gradients has bounded $k$-th moments. We provide improved upper bounds on the excess population risk under approximate differential privacy of $\tilde{O}\left(\sqrt{\frac{d}{n}}+\left(\frac{d}{\epsilon n}\right)^{\frac{k-1}{k}}\right)$ and $\tilde{O}\left(\frac{d}{n}+\left(\frac{d}{\epsilon n}\right)^{\frac{2k-2}{k}}\right)$ for convex and strongly convex loss functions, respectively. We also prove nearly-matching lower bounds under the constraint of pure differential privacy, giving strong evidence that our bounds are tight.
Gautam Kamath, Xingtu Liu, Huanyu Zhang
-

N/A

Ryan McKenna, Hristo Paskov, Kunal Talwar
-
[ Visit Poster at Spot C6 in Virtual World ]

Private data analysis suffers a costly curse of dimensionality. However, the data often has an underlying low-dimensional structure. For example, when optimizing via gradient descent, the gradients often lie in or near a low-dimensional subspace. If that low-dimensional structure can be identified, then we can avoid paying (in terms of privacy or accuracy) for the high ambient dimension.

We present differentially private algorithms that take input data sampled from a low-dimensional linear subspace (possibly with a small amount of error) and output that subspace (or an approximation to it). These algorithms can serve as a pre-processing step for other procedures.
Vikrant Singhal, Thomas Steinke
-
[ Visit Poster at Spot C1 in Virtual World ]
Quantiles are often used for summarizing and understanding data. If that data is sensitive, it may be necessary to compute quantiles in a way that is differentially private, providing theoretical guarantees that the result does not reveal private information. However, in the common case where multiple quantiles are needed, existing differentially private algorithms scale poorly: they compute each quantile individually, splitting their privacy budget and thus decreasing accuracy. In this work we propose an instance of the exponential mechanism that simultaneously estimates $m$ quantiles from $n$ data points while guaranteeing differential privacy. The utility function is carefully structured to allow for an efficient implementation that returns estimates of all $m$ quantiles in time $O(mn\log(n) + m^2n)$. Experiments show that our method significantly outperforms the current state of the art on both real and synthetic data while remaining efficient enough to be practical.
Jennifer Gillenwater, Matthew Joseph, Alex Kulesza
-
[ Visit Poster at Spot D1 in Virtual World ]
There has been much recent work in the shuffle model of differential privacy, particularly for approximate $d$-bin histograms. While these protocols achieve low error, the number of messages sent by each user---the message complexity---has so far scaled with $d$ or the privacy parameters. The message complexity is an informative predictor of a shuffle protocol's resource consumption. We present a protocol whose message complexity is two when there are sufficiently many users. The protocol essentially pairs each row in the dataset with a fake row and performs a simple randomization on all rows. We show that the error introduced by the protocol is small, using rigorous analysis as well as experiments on real-world data. We also prove that corrupt users have a relatively low impact on our protocol's estimates.
Albert Cheu
-
[ Visit Poster at Spot A6 in Virtual World ]
In shuffle privacy, each user sends a collection of randomized messages to a trusted shuffler, the shuffler randomly permutes these messages, and the resulting shuffled collection of messages must satisfy differential privacy. Prior work in this model has largely focused on protocols that use a single round of communication to compute algorithmic primitives like means, histograms, and counts. In this work, we present interactive shuffle protocols for stochastic convex optimization. Our optimization protocols rely on a new noninteractive protocol for summing vectors of bounded $\ell_2$ norm. By combining this sum subroutine with techniques including mini-batch stochastic gradient descent, accelerated gradient descent, and Nesterov's smoothing method, we obtain loss guarantees for a variety of convex loss functions that significantly improve on those of the local model and sometimes match those of the central model.
Albert Cheu, Matthew Joseph, Jieming Mao, Binghui Peng
-
[ Visit Poster at Spot D5 in Virtual World ]
We present two sample-efficient differentially private mean estimators for $d$-dimensional (sub)Gaussian distributions with unknown covariance. Informally, given $n \gtrsim d/\alpha^2$ samples from such a distribution with mean $\mu$ and covariance $\Sigma$, our estimators output $\tilde\mu$ such that $\| \tilde\mu - \mu \|_{\Sigma} \leq \alpha$, where $\| \cdot \|_{\Sigma}$ is the \emph{Mahalanobis distance}. All previous estimators with the same guarantee either require strong a priori bounds on the covariance matrix or require $\Omega(d^{3/2})$ samples. Each of our estimators is based on a simple, general approach to designing differentially private mechanisms, but with novel technical steps to make the estimator private and sample-efficient. Our first estimator samples a point with approximately maximum Tukey depth using the exponential mechanism, but restricted to the set of points of large Tukey depth. Proving that this mechanism is private requires a novel analysis. Our second estimator perturbs the empirical mean of the data set with noise calibrated to the empirical covariance. Only the mean is released, however; the covariance is only used internally. Its sample complexity guarantees hold more generally for subgaussian distributions, albeit with a slightly worse dependence on the privacy parameter. For both estimators, careful preprocessing of the data is required to satisfy differential privacy.
Gavin Brown, Marco Gaboradi, Adam Smith, Jonathan Ullman, Lydia Zakynthinou
-
[ Visit Poster at Spot A6 in Virtual World ]

There is growing evidence that complex neural networks memorize their training data, and that privacy attacks (e.g. membership inference) allow an adversary to recover that training data from the model. Differential privacy provides a defense against these attacks, but reduces accuracy. We present a new defense against privacy attacks in deep learning, inspired by differential privacy, that scales the noise added to gradients using immediate sensitivity---a novel approximation of the local sensitivity of a gradient calculation. Our empirical evaluation suggests that our approach produces higher accuracy for a desired level of privacy than gradient-clipping-based differentially private training.

Tim Stevens, David Darais, Ben U Gelman, David Slater, Joseph Near
-
[ Visit Poster at Spot D0 in Virtual World ]

We study the problem of differentially private (DP) matrix completion under user-level privacy. We design a joint differentially private variant of the popular Alternating-Least-Squares (ALS) method that achieves: i) (nearly) optimal sample complexity for matrix completion (in terms of number of items, users), and ii) the best known privacy/utility trade-off both theoretically, as well as on benchmark data sets. In particular, we provide the first global convergence analysis of ALS with noise introduced to ensure DP, and show that, in comparison to the best known alternative (the Private Frank-Wolfe algorithm by Jain et al. (2018)), our error bounds scale significantly better with respect to the number of items and users, which is critical in practical problems. Extensive validation on standard benchmarks demonstrate that the algorithm, in combination with carefully designed sampling procedures, is significantly more accurate than existing techniques, thus promising to be the first practical DP embedding model.

Steve Chien, Prateek Jain, Walid Krichene, Steffen Rendle, Shuang Song, Abhradeep Guha Thakurta, Li Zhang
-
[ Visit Poster at Spot A1 in Virtual World ]

We formalize a notion of a privacy wrapper, defined as an algorithm that can take an arbitrary and untrusted script and produce an output with differential privacy guarantees. Our novel privacy wrapper design leverages subsets of data and halts to overcome the issue of unknowable sensitivity.

Nitin Kohli, Paul Laskowski
-
[ Visit Poster at Spot B0 in Virtual World ]

Gaussian processes (GPs) are non-parametric Bayesian models that are widely used for diverse prediction tasks. Previous work in adding strong privacy protection to GPs via differential privacy (DP) has been limited to protecting only the privacy of the prediction targets (model outputs) but not inputs. We break this limitation by introducing GPs with DP protection for both model inputs and outputs. We achieve this by using sparse GP methodology and publishing a private variational approximation on known inducing points. The approximation covariance is adjusted to approximately account for the added uncertainty from DP noise. The approximation can be used to compute arbitrary predictions using standard sparse GP techniques. We propose a method for hyperparameter learning using a private selection protocol applied to validation set log-likelihood. Our experiments demonstrate that given sufficient amount of data, the method can produce accurate models under strong privacy protection.

Antti Honkela
-
[ Visit Poster at Spot C4 in Virtual World ]

In recent years, formal methods of privacy protection such as differential privacy (DP), capable of deployment to data-driven tasks such as machine learning (ML), have emerged. Reconciling large-scale ML with the closed-form reasoning required for the principled analysis of individual privacy loss requires the introduction of new tools for automatic sensitivity analysis and for tracking an individual's data and their features through the flow of computation. For this purpose, we introduce a novel hybrid automatic differentiation (AD) system which combines the efficiency of reverse-mode AD with an ability to obtain a closed-form expression for any given quantity in the computational graph. This enables modelling the sensitivity of arbitrary differentiable function compositions, such as the training of neural networks on private data. We demonstrate our approach by analysing the individual DP guarantees of statistical database queries. Moreover, we investigate the application of our technique to the training of DP neural networks. Our approach can enable the principled reasoning about privacy loss in the setting of data processing, and further the development of automatic sensitivity analysis and privacy budgeting systems.

Alexander Ziller, Dmitrii Usynin, Moritz Knolle, Kritika Prakash, Andrew Trask, Marcus Makowski, Rickmer Braren, Daniel Rueckert, Georgios Kaissis
-
[ Visit Poster at Spot B2 in Virtual World ]
The stochastic block model (SBM) and degree-corrected block model (DCBM) are network models often selected as the fundamental setting in which to analyze the theoretical properties of community detection methods. We consider the problem of spectral clustering of SBM and DCBM networks under a local form of edge differential privacy. Using a randomized response privacy mechanism called the edge-flip mechanism, we develop theoretical guarantees for differentially private community detection, demonstrating conditions under which this strong privacy guarantee can be upheld while achieving spectral clustering convergence rates that match the known rates without privacy. We prove the strongest theoretical results are achievable for dense networks (those with node degree linear in the number of nodes), while weak consistency is achievable under mild sparsity (node degree greater than $n^{-1/2}$).
Jonathan Hehir, Aleksandra Slavković, Xiaoyue Niu
-
[ Visit Poster at Spot C5 in Virtual World ]

All current approaches for statically enforcing differential privacy in higher order languages make use of either linear or relational refinement types. A barrier to adoption for these approaches is the lack of support for expressing these “fancy types” in mainstream programming languages. For example, no mainstream language supports relational refinement types, and although Rust and modern versions of Haskell both employ some linear typing techniques, they are inadequate for embedding enforcement of differential privacy, which requires “full” linear types a la Girard/Reynolds. We propose a new type system that enforces differential privacy, avoids the use of linear and relational refinement types, and can be easily embedded in mainstream richly typed programming languages such as Scala, OCaml and Haskell.

Chike Abuah, David Darais, Joseph Near
-

This article describes the proposed differentially private (DP) algorithms that the US Census Bureau will use to release the Detailed Demographic and Housing Characteristics (DHC) Race & Ethnicity tabulations as part of the 2020 Decennial Census. The tabulations contain statistics (counts) of demographic and housing characteristics of the entire population of the US crossed with detailed races and tribes at varying levels of geography. We describe two differentially private algorithmic strategies, one based on adding noise drawn from a two-sided Geometric distribution that satisfies "pure"-DP, and another based on adding noise from a Discrete Gaussian distribution that satisfied a well studied variant of differential privacy, called Zero Concentrated Differential Privacy (zCDP). We analytically estimate the privacy loss parameters ensured by the two algorithms for comparable levels of error introduced in the statistics.

Samuel Haney, William Sexton, Ashwin Machanavajjhala, Michael Hay, Gerome Miklau
-
[ Visit Poster at Spot A5 in Virtual World ]

Many agencies release datasets and statistics about groups of individuals that are used as input to a number of critical decision processes. To conform with privacy requirements, these agencies are often required to release privacy-preserving versions of the data. This paper studies the release of differentially private datasets and analyzes their impact on some critical resource allocation tasks under a fairness perspective. The paper shows that, when the decisions take as input differentially private data, the noise added to achieve privacy disproportionately impacts some groups over others. The paper analyzes the reasons for these disproportionate impacts and proposes guidelines to mitigate these effects in the full version of this work. The proposed approaches are evaluated on critical decision problems that use differentially private census data.

Cuong Tran, Ferdinando Fioretto
-
[ Visit Poster at Spot B3 in Virtual World ]
The recently proposed Fast Fourier Transform (FFT)-based accountant for evaluating $(\varepsilon,\delta)$-differential privacy guarantees using the privacy loss distribution formalism has been shown to give tighter bounds than commonly used methods such as Rényi accountants when applied to homogeneous compositions, i.e., to compositions of identical mechanisms. In this work, we extend this approach to heterogeneous compositions. We carry out a full error analysis that allows choosing the parameters of the algorithm such that a desired accuracy is obtained. The analysis also extends previous results by taking into account all the parameters of the algorithm. We also show how to speed up the evaluation of tight privacy guarantees using the Plancherel theorem at the cost of increased pre-computation and memory usage.
Antti Koskela, Antti Honkela
-
[ Visit Poster at Spot D0 in Virtual World ]

We propose a new framework of synthesizing data using deep generative models in a differentially private manner. Within our framework, sensitive data are sanitized with rigorous privacy guarantees in a one-shot fashion, such that the training of deep generative models is possible without re-using the original data. Hence, no extra privacy costs or model constraints are incurred, in contrast to popular approaches such as Differentially Private Stochastic Gradient Descent (DP-SGD), which, among other issues, causes degradation in privacy guarantees as the training iteration increases. We demonstrate a realization of our framework by making use of the characteristic function and an adversarial re-weighting objective, which are of independent interest as well. Our proposal has theoretical guarantees of performance, and empirical evaluations on multiple datasets show that our approach outperforms other methods at reasonable levels of privacy.

Seng Pei Liew, Tsubasa Takahashi, Michihiko Ueno
-
[ Visit Poster at Spot A1 in Virtual World ]

We show that differentially private stochastic gradient descent (DP-SGD) can yield poorly calibrated, overconfident deep learning models. This represents a serious issue for safety-critical applications, e.g. in medical diagnosis. We highlight and exploit parallels between stochastic gradient Langevin dynamics, a scalable Bayesian inference technique for training deep neural networks,and DP-SGD, in order to train differentially private, Bayesian neural networks with minor adjustments to the original (DP-SGD) algorithm.Our approach provides considerably more reliable uncertainty estimates than DP-SGD, as demonstrated empirically by a reduction in expected calibration error (MNIST∼5-fold, Pediatric Pneumonia Dataset∼2-fold).

Moritz Knolle, Alexander Ziller, Dmitrii Usynin, Rickmer Braren, Marcus Makowski, Daniel Rueckert, Georgios Kaissis
-
[ Visit Poster at Spot B2 in Virtual World ]
We provide a lowerbound on the sample complexity of distribution-free parity learning in the realizable case in the shuffle model of differential privacy. Namely, we show that the sample complexity of learning $d$-bit parity functions is $\Omega(2^{d/2})$. Our result extends a recent similar lowerbound on the sample complexity of private agnostic learning of parity functions in the shuffle model by Cheu and Ullman. We also sketch a simple shuffle model protocol demonstrating that our results are tight up to $\mbox{poly}(d)$ factors.
kobbi nissim, Chao Yan
-
[ Visit Poster at Spot B4 in Virtual World ]
We consider the problem of learning multivariate Gaussians under the constraint of approximate differential privacy. We prove that $\widetilde{O}(k^2 d \log^{3/2}(1/\delta) / \alpha^2 \varepsilon)$ samples are sufficient to learn a mixture of $k$ axis-aligned Gaussians in $\mathbb{R}^d$ to within total variation distance $\alpha$ while satisfying $(\varepsilon, \delta)$-differential privacy. This is the first result for privately learning mixtures of unbounded axis-aligned (or even unbounded univariate) Gaussians. If the covariance matrices of each of the Gaussians is the identity matrix, we show that $\widetilde{O}(kd/\alpha^2 + kd \log(1/\delta) / \alpha \varepsilon)$ samples are sufficient. To prove our results, we design a new technique for privately learning mixture distributions. A class of distributions $\mathcal{F}$ is said to be list-decodable if there is an algorithm that, given "heavily corrupted" samples from $f \in \mathcal{F}$, outputs a list of distributions, $\widehat{\mathcal{F}}$, such that one of the distributions in $\widehat{\mathcal{F}}$ approximates $f$. We show that if $\mathcal{F}$ is privately list-decodable then we can learn mixtures of distributions in $\mathcal{F}$. Finally, we show axis-aligned Gaussian distributions are privately list-decodable, thereby proving mixtures of such distributions are privately learnable.
Ishaq Aden-Ali, Hassan Ashtiani, Christopher Liaw
-
[ Visit Poster at Spot D5 in Virtual World ]

Releasing temporal aggregate signals with enough privacy guarantees is still tricky despite their wide applications and impact on society. The main difficulty lies in their sensitivity which scales linearly with the signal length. We analyze that one can reduce the sensitivity by subsampling in the time domain under reasonable assumptions. Then, based on the analysis, we propose a differentially private algorithm that utilizes signal subsampling and filtering. We demonstrate the utility gain of our algorithm empirically with the real and synthetic signals.

Tatsuki Koga, Casey M Meehan, Kamalika Chaudhuri
-
[ Visit Poster at Spot C2 in Virtual World ]

Property inference attacks reveal statistical properties about a training set but are difficult to distinguish from the primary purposes of statistical machine learning, which is to produce models that capture statistical properties about a distribution. Motivated by Yeom et al.'s membership inference framework, we propose a formal and generic definition of property inference attacks. The proposed notion describes attacks that can distinguish between possible training distributions, extending beyond previous property inference attacks that infer the ratio of a particular type of data in the training data set. In this paper, we show how our definition captures previous property inference attacks as well as a new attack that reveals the average degree of nodes of a training graph and report on experiments giving insight into the potential risks of property inference attacks.

Anshuman Suri, Anshuman Suri, David Evans
-
[ Visit Poster at Spot D2 in Virtual World ]
We generalize the continuous observation privacy setting from Dwork et al. '10 and Chan et al. '11 by allowing each event in a stream to be a subset of some (possibly unknown) universe of items. We design differentially private (DP) algorithms for histograms in several settings, including top-$k$ selection, with privacy loss that scales with polylog$(T)$, where $T$ is the maximum length of the input stream. We present a meta-algorithm that can use existing one-shot top-$k$ private algorithms as a subroutine to continuously release DP histograms from a stream. Further, we present more practical DP algorithms for two settings: 1) continuously releasing the top-$k$ counts from a histogram over a known domain when an event can consist of an arbitrary number of items, and 2) continuously releasing histograms over an unknown domain when an event has a limited number of items.
Adrian Rivera Cardoso, Ryan Rogers
-
[ Visit Poster at Spot A2 in Virtual World ]
Despite recent widespread deployment of differential privacy, relatively little is known about what users think of differential privacy. In this work, we seek to explore users' privacy expectations related to differential privacy. Specifically, we investigate (1) whether users care about the protections afforded by differential privacy, and (2) whether they are therefore more willing to share their data with differential private systems. Further, we attempt to understand (3) users' privacy expectations of the differential private systems they may encounter in practice and (4) their willingness to share data in such systems. To answer these questions, we use a series of rigorously conducted surveys ($n=2424$). We find that users care about the kinds of information leaks against which differential privacy protects and are more willing to share their private information when the risks of these leaks are less likely to happen. Additionally, we find that the ways in which differential privacy is described in the wild haphazardly set users' privacy expectations, which can be misleading depending on the deployment. We synthesize our results into a framework for understanding a user's willingness to share information with differentially private systems, which takes into account the interaction between the user's prior privacy concerns and how differential privacy is described to them.
Gabriel Kaptchuk, Rachel Cummings, Elissa M Redmiles
-
[ Visit Poster at Spot D4 in Virtual World ]

Ensuring the privacy of users whose data are used to train Natural Language Processing (NLP) models is necessary to build and maintain customer trust. Differential Privacy (DP) has emerged as the most successful method to protect the privacy of individuals.

However, applying DP to the NLP domain comes with unique challenges. The most successful previous methods use a generalization of DP for metric spaces, and apply the privatization by adding noise to inputs in the metric space of word embeddings. However, these methods assume that one specific distance measure is being used, ignore the density of the space around the input, and assume the embeddings used have been trained on non-sensitive data.

In this work we propose the Truncated Exponential Mechanism (TEM), a general method that allows the privatization of words using any distance metric, on embeddings that can be trained on sensitive data. Our method makes use of the exponential mechanism to turn the privatization step into a selection problem. This allows the noise applied to be calibrated to the density of the embedding space around the input, and makes domain adaptation possible for the embeddings. In our experiments, we demonstrate that our method significantly outperforms the state-of-the-art in terms of utility for the same level of privacy, while providing more flexibility in the metric selection.

Ricardo Silva Carvalho, Theodore Vasiloudis, Seyi Feyisetan
-
[ Visit Poster at Spot A5 in Virtual World ]
We study differentially private graph algorithms under continual observation, i.e., for dynamic graph problems. The input is a sequence of graphs that results from node or edge updates, that is, insertions or deletions. We consider both non-local problems and local problems: problems are non-local if their value cannot be derived from the frequency histogram of constant-size subgraphs, and local otherwise. The only prior work on differentially private dynamic algorithms is an algorithm by Song et al.~~\cite{song18} for various graph problems such as counting high-degree nodes, triangles and other constant-size subgraphs. We present an algorithm for these local problems that improves the additive error by a factor of $\sqrt{T}/\log^{3/2}T$, where $T$ is the length of the graph sequence. Furthermore, we introduce the first differentially private algorithms for non-local graph problems in the partially dynamic setting, i.e., where only either insertions or deletions are allowed. These problems include the global and $s,t$-minimum cut, the cost of a minimum spanning tree and the density of the densest subgraph. We complement our upper bounds for these problems by giving some lower bounds on the additive error of any differentially private dynamic graph algorithm. By a reduction from differentially private counting we can show lower bounds of $\Omega(W \log T)$, resp. $\Omega(\log T)$, for the problems we study, where $W$ is the maximum edge weight. All these results apply to the event-level, where adjacent graph sequences differ in a single insertion or deletion. For the more challenging notion of user-level privacy, where adjacent graph sequences differ in any number of updates to a single edge or node, we show strong lower bounds that are linear in the maximum function value.
Hendrik Fichtenberger, Monika Henzinger, Wolfgang Ost
-
[ Visit Poster at Spot A3 in Virtual World ]

Despite intense interest and considerable effort, the current generation of neural networks suffers a significant loss of accuracy under most practically relevant privacy training regimes. One particularly challenging class of neural networks are the wide ones, such as those deployed for NLP or recommender systems.

Observing that these models share something in common---an embedding layer that reduces the dimensionality of the input---we focus on developing a general approach towards training these models that takes advantage of the sparsity of the gradients. We propose a novel algorithm for privately training neural networks. Furthermore, we provide an empirical study of a DP wide neural network on a real-world dataset, which has been rarely explored in the previous work.

Huanyu Zhang, Ilya Mironov, Meisam Hejazinia
-
[ Visit Poster at Spot C3 in Virtual World ]

We describe the privatization method used in reporting labor market insights from LinkedIn's Economic Graph, including the differentially private algorithms used to protect member's privacy. The reports show who are the top employers, as well as what are the top jobs and skills in a given country/region and industry. We hope this data will help governments and citizens track labor market trends during the COVID-19 pandemic while also protecting the privacy of our members.

Ryan Rogers, Adrian Rivera Cardoso, Koray Mancuhan, Akash Kaura, Nikhil Gahlawat, Neha Jain, Paul Ko, Parvez Ahammad
-
[ Visit Poster at Spot A0 in Virtual World ]

CDC WONDER is a web-based tool for the dissemination of epidemiologic data such as the number of deaths stratified by geographic region, year, demographic variables, and specific causes of death. Motivated by the lack of formal privacy protections in place on CDC WONDER, recent work has proposed the creation of a Synthetic CDC WONDER based on a differentially private Poisson-gamma modeling framework, which samples values from the posterior predictive distribution associated with modeling event count data with a Poisson-likelihood and assuming a gamma prior on the underlying event rate. Unlike more traditional approaches for generating differentially private synthetic data, the Poisson-gamma framework incorporates (and relies on) publicly available information such as the estimates of the underlying population sizes and event rates to improve its utility and protects the sensitive data by increasing the informativeness of the prior distribution. While the Poisson-gamma framework has been shown to be capable of producing synthetic data with high utility, its performance has yet to be compared to more conventional techniques for satisfying differential privacy, such as those that rely on output perturbation. The goal of this work is to present a comparison of the Poisson-gamma framework and the Laplace mechanism for the purpose of generating a synthetic dataset comprised of the 26,000 cancer-related deaths in Pennsylvania counties from 1980. Here, we demonstrate that while the Poisson-gamma framework preserves inference on the urban/rural disparity in death rates, the Laplace mechanism -- when forced to produce non-negative values -- produces synthetic data that fail to preserve both the magnitude of the disparity and its direction.

Harrison Quick, Kyle Chen, David DeLara
-
[ Visit Poster at Spot B6 in Virtual World ]

Differential privacy provides strong privacy guarantees for machine learning applications. Much recent work has been focused on developing differentially private models, however there has been a gap in other stages of the machine learning pipeline, in particular during the preprocessing phase. Our contributions are twofold: we adapt a privacy violation detection framework based on statistical methods to empirically measure privacy levels of machine learning pipelines, and apply the newly created framework to show that resampling techniques used when dealing with imbalanced datasets cause the resultant model to leak more privacy. These results highlight the need for developing private preprocessing techniques.

Ashly Lau, Jonathan Passerat-Palmbach