Workshop

Economics of privacy and data labor

Nikolaos Vasiloglou, Rachel Cummings, Glen Weyl, Paris Koutris, Meg Young, Ruoxi Jia, David Dao, Bo Waggoner

Keywords:  Economics    Market design    Auctions    Privacy    Data labor    Data pricing    Data Valuation    Data Markets    Data Exchanges  

Abstract:

Although data is considered to be the “new oil”, it is very hard to be priced. Raw use of data has been invaluable in several sectors such as advertising, healthcare, etc, but often in violation of people’s privacy. Labeled data has also been extremely valuable for the training of machine learning models (driverless car industry). This is also indicated by the growth of annotation companies such as Figure8 and Scale.AI, especially in the image space. Yet, it is not clear what is the right pricing for data workers who annotate the data or the individuals who contribute their personal data while using digital services. In the latter case, it is very unclear how the value of the services offered is compared to the private data exchanged. While the first data marketplaces have appeared, such as AWS, Narattive.io, nitrogen.ai, etc, they suffer from a lack of good pricing models. They also fail to maintain the right of the data owners to define how their own data will be used. There have been numerous suggestions for sharing data while maintaining privacy, such as training generative models that preserve original data statistics.

Chat is not available.

Timezone: »

Schedule

Sat 7:00 a.m. - 7:15 a.m.

We study differentially private mean estimation in a high-dimensional setting. Existing differential privacy techniques applied to large dimensions lead to computationally intractable problems or estimators with excessive privacy loss. Recent work in high-dimensional robust statistics has identified computationally tractable mean estimation algorithms with asymptotic dimension-independent error guarantees. We incorporate these results to develop a strict bound on the global sensitivity of the robust mean estimator. This yields a computationally tractable algorithm for differentially private mean estimation in high dimensions with dimension-independent privacy loss. Finally, we show on synthetic data that our algorithm significantly outperforms classic differential privacy methods, overcoming barriers to high-dimensional differential privacy.

Sat 7:15 a.m. - 7:30 a.m.

Recent advances in generating synthetic data that allow to add principled ways of protecting privacy -- such as Differential Privacy -- are a crucial step in sharing statistical information in a privacy preserving way. But while the focus has been on privacy guarantees, the resulting private synthetic data is only useful if it still carries statistical information from the original data. To further optimise the inherent trade-off between data privacy and data quality, it is necessary to think closely about the latter. What is it that data analysts want? Acknowledging that data quality is a subjective concept, we develop a framework to evaluate the quality of differentially private synthetic data from an applied researcher's perspective. Data quality can be measured along two dimensions. First, quality of synthetic data can be evaluated against training data or against an underlying population. Second, the quality of synthetic data depends on general similarity of distributions or on performance for specific tasks such as inference or prediction. It is clear that accommodating all goals at once is a formidable challenge. We invite the academic community to jointly advance the privacy-quality frontier.

Sat 7:30 a.m. - 7:45 a.m.

Recent advances in generating synthetic data that allow to add principled ways of protecting privacy -- such as Differential Privacy -- are a crucial step in sharing statistical information in a privacy preserving way. But while the focus has been on privacy guarantees, the resulting private synthetic data is only useful if it still carries statistical information from the original data. To further optimise the inherent trade-off between data privacy and data quality, it is necessary to think closely about the latter. What is it that data analysts want? Acknowledging that data quality is a subjective concept, we develop a framework to evaluate the quality of differentially private synthetic data from an applied researcher's perspective. Data quality can be measured along two dimensions. First, quality of synthetic data can be evaluated against training data or against an underlying population. Second, the quality of synthetic data depends on general similarity of distributions or on performance for specific tasks such as inference or prediction. It is clear that accommodating all goals at once is a formidable challenge. We invite the academic community to jointly advance the privacy-quality frontier.

Sat 7:45 a.m. - 8:00 a.m.
Break
Sat 8:00 a.m. - 9:00 a.m.
Buying data over time by Nicole Immorlica (Invited Talk)
Sat 9:00 a.m. - 9:15 a.m.

We study the \emph{secure} stochastic convex optimization problem: a learner aims to learn the optimal point of a convex function through sequentially querying a (stochastic) gradient oracle, in the meantime, there exists an adversary who aims to free-ride and infer the learning outcome of the learner from observing the learner's queries. The adversary observes only the points of the queries but not the feedback from the oracle. The goal of the learner is to optimize the accuracy, i.e., obtaining an accurate estimate of the optimal point, while securing her privacy, i.e., making it difficult for the adversary to infer the optimal point. We formally quantify this tradeoff between learner’s accuracy and privacy and characterize the lower and upper bounds on the learner's query complexity as a function of desired levels of accuracy and privacy. For the analysis of lower bounds, we provide a general template based on information theoretical analysis and then tailor the template to several families of problems, including stochastic convex optimization and (noisy) binary search. We also present a generic secure learning protocol that achieves the matching upper bound up to logarithmic factors.

Sat 9:15 a.m. - 9:30 a.m.

Recommender systems are an essential part of any e-commerce platform. Recommendations are typically generated by aggregating large amounts of user data. A malicious actor may be motivated to sway the output of such recommender systems by injecting malicious datapoints to leverage the system for financial gain. In this work, we propose a semi-supervised attack detection algorithm to identify the malicious datapoints. We do this by leveraging a portion of the dataset that has a lower chance of being polluted to learn the distribution of genuine datapoints. Our proposed approach modifies the Generative Adversarial Network architecture to take into account the contextual information from user activity. This allows the model to distinguish legitimate datapoints from the injected ones.

Sat 9:30 a.m. - 9:45 a.m.

While many solutions for privacy-preserving convex empirical risk minimization (ERM) have been developed, privacy-preserving nonconvex ERM remains a challenge. We study nonconvex ERM, which takes the form of minimizing a finite-sum of nonconvex loss functions over a training set. We propose a new differentially private stochastic gradient descent algorithm for nonconvex ERM that achieves strong privacy guarantees efficiently, and provide a tight analysis of its privacy and utility guarantees, as well as its gradient complexity. Our algorithm substantially reduces gradient complexity while matching the best previous utility guarantee given by Wang et al.\ (NeurIPS 2017). Our experiments on benchmark nonconvex ERM problems demonstrate superior performance in terms of both training cost and utility gains compared with previous differentially private methods using the same privacy budgets.

Sat 9:45 a.m. - 10:30 a.m.
Break
Sat 10:30 a.m. - 10:45 a.m.

We demonstrate how privacy law interacts with competition and trade policy in the context of the European General Data Protection Regulation (GDPR). We follow more than 110,000 websites for 18 months to show that websites reduced their connections to web technology providers after GDPR became effective, especially regarding requests involving personal data. This also holds for websites catering to non-EU audiences and therefore not bound by GDPR. We further document an increase in market concentration in web technology services after the introduction of GDPR. While most firms lose market share, the leading firm, Google, significantly increases market share.

Sat 10:45 a.m. - 11:00 a.m.

Prediction APIs offered for a fee are a fast growing industry and an important part of machine learning as a service. While many such services are available, the heterogeneity in their price and performance makes it challenging for users to decide which API or combination of APIs to use for their own data and budget. In this paper, we take a first step towards addressing this challenge by proposing FrugalML, a principled framework that jointly learns the strength and weakness of each API on different data, and performs an efficient optimization to automatically identify the best sequential strategy to adaptively use the available APIs within a budget constraint. Preliminary experiments using ML APIs from Google, Microsoft and Face++ for a facial emotion recognition task show that FrugalML typically leads to more than 50% cost reduction while matching the accuracy of the best single API.

Sat 11:00 a.m. - 11:15 a.m.

Collection and sale of personal data is a common and economically rewarding activity. However, the contractual model of notice and consent that governs this activity under U.S. law relies on an assumption that personal data can and does function as a market good. This paper presents experimental evidence of a conflict between the market nature of personal data assumed by many legal frameworks and the conceptual categorization of personal data transactions by the ordinary people putatively protected by notice and consent legal frameworks. I present two online vignette studies that repurpose designs from the taboo trade-offs literature and suggest that protection of personal data rises to the level of a sacred value.

Sat 11:15 a.m. -
BREAK
Sat 11:30 a.m. - 12:30 p.m.

Data are interpersonal relational rather than atomistically personal or universally objective. Yet the group of people to which a datum pertains is different across all data pertaining to any person, and thus every person sits at the intersection of a diversity of data collectives. A data structure representing this as well as interpersonal relationships of trust has the potential to add a trust layer to internet-type structures, allow the verification at higher levels of trust of a far wider range of data than current data structures and eventually to enable political economies far more sophisticated than even those currently considered innovative (such as current advocated by organizations like RadicalxChange) .