Workshop

Machine Learning for Data: Automated Creation, Privacy, Bias

Zhiting Hu Hu, Li Erran Li, Willie Neiswanger, Benedikt Boecking, Yi Xu, Belinda Zeng

Abstract:

As the use of machine learning (ML) becomes ubiquitous, there is a growing understanding and appreciation for the role that data plays for building successful ML solutions. Classical ML research has been primarily focused on learning algorithms and their guarantees. Recent progress has shown that data is playing an increasingly central role in creating ML solutions, such as the massive text data used for training powerful language models, (semi-)automatic engineering of weak supervision data that enables applications in few-labels settings, and various data augmentation and manipulation techniques that lead to performance boosts on many real world tasks. On the other hand, data is one of the main sources of security, privacy, and bias issues in deploying ML solutions in the real world. This workshop will focus on the new perspective of machine learning for data --- specifically how ML techniques can be used to facilitate and automate a range of data operations (e.g. ML-assisted labeling, synthesis, selection, augmentation), and the associated challenges of quality, security, privacy and fairness for which ML techniques can also enable solutions.

Chat is not available.

Timezone: »

Schedule

Fri 8:00 a.m. - 8:10 a.m.
Opening Remarks (opening)   
Fri 8:10 a.m. - 8:50 a.m.
Invited Talk: David Alvarez-Melis. Comparing, Transforming, and Optimizing Datasets with Optimal Transport. (Invited Talk)   
David Alvarez-Melis
Fri 8:50 a.m. - 9:30 a.m.
Invited Talk: Lora Aroyo (Invited Talk)   
Lora Aroyo
Fri 9:30 a.m. - 9:45 a.m.
Spotlight: SNoB: Social Norm Bias of “Fair” Algorithms (Spotlight)   
Myra Cheng
Fri 9:45 a.m. - 10:00 a.m.
Spotlight: CDCGen: Cross-Domain Conditional Generation via Normalizing Flows and Adversarial Training (Spotlight)   
Hari Prasanna Das
Fri 10:20 a.m. - 11:00 a.m.
  

Speaker email: epxing@cs.cmu.edu

Eric Xing
Fri 11:00 a.m. - 11:40 a.m.
Invited Talk: Kamalika Chaudhuri (Invited Talk)   
Kamalika Chaudhuri
Fri 11:40 a.m. - 12:30 p.m.
[ Visit Poster at Spot C3 in Virtual World ]  link »

https://eventhosts.gather.town/0j0So7wYRCaUvO7F/icml2021ml4data

Fri 1:30 p.m. - 2:10 p.m.
Invited Talk: Hoifung Poon. Task-Specific Self-Supervised Learning for Precision Medicine. (Invited Talk)   
Hoifung Poon
Fri 2:10 p.m. - 2:50 p.m.
Invited Talk: Dawn Song. Towards building a responsible data economy. (Invited Talk)   
Dawn Song
Fri 2:50 p.m. - 3:05 p.m.
Spotlight: An Analysis of the Deployment of Models Trained on Private Tabular Synthetic Data: Unexpected Surprises (Spotlight)   
Mayana Wanderley Pereira
Fri 3:20 p.m. - 4:00 p.m.
Invited Talk: Alex Ratner. Programmatic weak supervision for data-centric AI. (Invited Talk)   
Alex Ratner
Fri 4:00 p.m. - 4:40 p.m.
Invited Talk: Kumar Chellapilla. Machine Learning with Humans-in-the-loop (HITL) (Invited Talk)   
Kumar Chellapilla
Fri 4:40 p.m. - 5:20 p.m.
Panel Discussion with Hoifung Poon, Kamalika Chaudhuri, Paroma Varma, and Kumar Chellapilla (panel Discussion)   
-
[ Visit Poster at Spot A4 in Virtual World ]

Understanding the performance of machine learning model across diverse data distributions is critically important for reliable applications. Motivated by this, there is a growing focus on curating benchmark datasets that capture distribution shifts. While valuable, the existing benchmarks are limited in that many of them only contain a small number of shifts and they lack systematic annotation about what is different across different shifts. We present MetaDataset---a collection of 12,868 sets of natural images across 410 classes---to address this challenge. We leverage the natural heterogeneity of Visual Genome and its annotations to construct MetaDataset. The key construction idea is to cluster images using its metadata, which provides context for each image (e.g. “cats with cars” or “cats in bathroom” that represent distinct data distributions. MetaDataset has two important benefits: first it contains orders of magnitude more natural data shifts than previously available. Second, it provides explicit explanations of what is unique about each of its data sets and a distance score that measures the amount of distribution shift between any two of its data sets. We demonstrate the utility of MetaDataset in benchmarking several recent proposals for training models to be robust to data shifts. We find that the simple empirical risk minimization performs the best when shifts are moderate and no method had a systematic advantage for large shifts. We also show how MetaDataset can help to visualize conflicts between data subsets during model training.

Weixin Liang, James Zou, Weixin Liang
-
[ Visit Poster at Spot A0 in Virtual World ]

It is fundamentally challenging for machine learning models to generalize to out-of-distribution data, in part due to spurious correlations. We first give a principled analysis by bounding the generalization risk on any unseen domain. Drawing inspiration from this risk upper bound, we propose a novel Disentangled representation learning method for Domain Generalization (DDG). In contrast to traditional approaches based on domain adversarial training and domain labels, DDG jointly learns semantic and variation encoders for disentanglement while employing strong regularizations from minimizing domain divergence and promoting semantic invariance. Our method is able to effectively disentangle semantic and variation factors. Such a disentanglement enables us to easily manipulate and augment the training data. Leveraging the augmented training data, DDG learns intrinsic representations of semantic concepts that are invariant to nuisance factors and generalize across different domains. Comprehensive experiments on a number of benchmarks show that DDG can achieve state-of-the-art performance on the task of domain generalization and uncover interpretable salient structure within data.

Hanlin Zhang, Yi-Fan Zhang, Weiyang Liu, Adrian Weller, Bernhard Schölkopf, Eric Xing
-
[ Visit Poster at Spot A5 in Virtual World ]
Recent advances in deep learning have drastically improved performance on many Natural Language Understanding (NLU) tasks. However, the data used to train NLU models may contain private information such as addresses or phone numbers, particularly when drawn from human subjects. It is desirable that underlying models do not expose private information contained in the training data. Differentially Private Stochastic Gradient Descent (DP-SGD) has been proposed as a mechanism to build privacy-preserving models. However, DP-SGD can be prohibitively slow to train. In this work, we propose a more efficient DP-SGD for training using a GPU infrastructure and apply it to fine-tuning models based on LSTM and transformer architectures. We report faster training times, alongside accuracy, theoretical privacy guarantees and success of Membership inference attacks for our models and observe that fine-tuning with proposed variant of DP-SGD can yield competitive models without significant degradation in training time and improvement in privacy protection. We also make observations such as looser theoretical $\epsilon, \delta$ can translate into significant practical privacy gains.
Christophe Dupuy, Radhika Arava, Rahul Gupta, Anna Rumshisky
-
[ Visit Poster at Spot C1 in Virtual World ]

How to generate conditional synthetic data for a domain without utilizing information about its labels/attributes? Our work presents a solution to the above question. We propose a transfer learning-based framework utilizing normalizing flows, coupled with both maximum-likelihood and adversarial training. We model a source domain (labels available) and a target domain (labels unavailable) with individual normalizing flows, and perform domain alignment to a common latent space using adversarial discriminators. Due to the invertible property of flow models, the mapping has exact cycle consistency. We also learn the joint distribution of the data samples and attributes in the source domain by employing an encoder to map attributes to the latent space via adversarial training. During the synthesis phase, given any combination of attributes, our method can generate synthetic samples conditioned on them in the target domain. Empirical studies confirm the effectiveness of our method on benchmarked datasets. We envision our method to be particularly useful for synthetic data generation in label-scarce systems by generating non-trivial augmentations via attribute transformations. These synthetic samples will introduce more entropy into the label-scarce domain than their geometric and photometric transformation counterparts, helpful for robust downstream tasks.

Hari Prasanna Das , Ryan Tran, Japjot Singh, Yu Wen Lin, Costas J. Spanos
-
[ Visit Poster at Spot A4 in Virtual World ]

Deep generative models have made much progress in improving training stability and quality of generated data. Recently there has been increased interest in the fairness of deep-generated data. Fairness is important in many applications, e.g law enforcement. Central to fair data generation are the fairness metrics for the assessment and evaluation of different generative models. In this paper, we first review fairness metrics proposed in previous works and highlight potential weaknesses. We then discuss a performance benchmark framework along with the assessment of alternatives metrics.

Chris Teo, Ngai-Man Cheung
-
[ Visit Poster at Spot B3 in Virtual World ]

With the use of personal devices connected to the Internet for tasks such as searches and shopping becoming ubiquitous, ensuring the privacy of the users of such services has become a requirement in order to build and maintain customer trust. While text privatization methods exist, they require the existence of a trusted party that collects user data before applying a privatization method to preserve users' privacy.

In this work we propose an efficient mechanism to provide metric differential privacy for text data on-device. With our solution, sensitive data never leaves the device and service providers only have access to privatized data to train models on and analyze.

We compare our algorithm to the state-of-the-art for text privatization, showing similar or better utility for the same privacy guarantees, while reducing the storage costs by orders of magnitude, enabling on-device text privatization.

Ricardo Silva Carvalho, Theodore Vasiloudis, Seyi Feyisetan
-
[ Visit Poster at Spot A3 in Virtual World ]

Mix-up has been proven efficient in improving model's generalization ability, and multiple extensions of the original mix-up has been introduced in recent years. However, these techniques mainly focus on the data instead of the neural network's performance. In this paper, we propose a new method to automatically learn the mix-up strategy with the gradient information and the reinforcement learning module. The mix-up strategy is controlled by a neural network trained with reinforcement learning to maximize the expected accuracy of the classifier on the validation set. Initial results show a faster convergence rate compared to other mix-up methods.

Long Luu, Zeyi Huang, Haohan Wang
-
[ Visit Poster at Spot C6 in Virtual World ]

Supervised machine learning algorithms fail to perform well in the presence of endogeneity in the explanatory variables. In this paper, we borrow from the literature on partial identification to propose deep causal inequalities that overcome this issue. Instead of relying on observed labels, the DeepCI estimator uses inferred inequalities from the observed behavior of agents in the data. This by construction can allow us to circumvent the issue of endogeneous explanatory variables in many cases. We provide theoretical guarantees for our estimator and demonstrate it is consistent under very mild conditions. We demonstrate through extensive simulations that our estimator outperforms standard supervised machine learning algorithms and existing partial identification methods.

Edvard Bakhitov, Aman Singh, Jiding Zhang
-
[ Visit Poster at Spot C5 in Virtual World ]

Regularization is a well-established technique in machine learning (ML) that facilitates an optimal bias-variance trade-off and consequently reduces model complexity and enhances explainability. In this article, we provide a reinterpretation of the regularization hyper-parameter, and argue that the lack of quantification of the costs and risks of false alarms in the loss function undermines the measurability of the economic value of using ML to the extent that might make it practically useless.

Nima Safaei, Pooria Assadi
-
[ Visit Poster at Spot A2 in Virtual World ]

Machine learning algorithms are increasingly used to inform critical decisions. There is a growing concern about bias, that algorithms may produce uneven outcomes for individuals in different demographic groups. In this work, we measure bias as the difference between mean prediction errors across groups. We show that even with unbiased input data, when a model is mis-specified: (1) population-level mean prediction error can still be negligible, but group-level mean prediction errors can be large; (2) such errors are not equal across groups; and (3) the difference between errors, i.e., bias, can take the worst-case realization. That is, when there are two groups of the same size, mean prediction errors for these two groups have the same magnitude but opposite signs. In closed form, we show such errors and bias are functions of the first and second moments of the joint distribution of features (for linear and probit regressions). We also conduct numerical experiments to show similar results in more general settings. Our work provides a first step for decoupling the impact of different causes of bias.

Yangfan Liang, Peter Zhang
-
[ Visit Poster at Spot B2 in Virtual World ]

We study private synthetic data generation for query release, where the goal is to construct a sanitized version of a sensitive dataset, subject to differential privacy, that approximately preserves the answers to a large collection of statistical queries. We first present an algorithmic framework that unifies a long line of iterative algorithms in the literature. Under this framework, we propose two new methods. The first method, private entropy projection (PEP), can be viewed as an advanced variant of MWEM that adaptively reuses past query measurements to boost accuracy. Our second method, generative networks with the exponential mechanism (GEM), circumvents computational bottlenecks in algorithms such as MWEM and PEP by optimizing over generative models parameterized by neural networks, which capture a rich family of distributions while enabling fast gradient-based optimization. We demonstrate that PEP and GEM empirically outperform existing algorithms. Furthermore, we show that GEM nicely incorporates prior information from public data while overcoming limitations of PMW^Pub, the existing state-of-the-art method that also leverages public data.

Terrance Liu, Giuseppe Vietri, Steven Wu
-
[ Visit Poster at Spot B6 in Virtual World ]

Training machine learning models with the ultimate goal of maximizing only the accuracy could results in learning biases from data, making the learned model discriminatory towards certain groups. One approach to mitigate this problem is to find a representation which is more likely to yield fair outcomes using fair representation learning. In this paper, we propose a new fair representation leaning approach that leverages different level of representation of data to tighten the fairness bounds of the learned representation. Our results show that stacking different auto encoders and enforcing fairness at different latent spaces result in an improvement of fairness compared to other existing approaches.

Patrik Joslin Kenfack, Adil Khan, Rasheed Hussain
-
[ Visit Poster at Spot C0 in Virtual World ]

Diferentially private (DP) synthetic datasets are a powerful approach for training machine learning models while respecting the privacy of individual data providers. The effect of DP on the fairness of the resulting trained models is not yet well understood. In this contribution, we systematically study the effects of differentially private synthetic data generation on classification. We analyze disparities in model utility and bias caused by the synthetic dataset, measured through algorithmic fairness metrics. Our first set of results show that although there seems to be a clear negative correlation between privacy and utility (the more private, the less accurate) across all data synthesizers we evaluated, more privacy does not necessarily imply more bias. Additionally, we assess the effects of utilizing synthetic datasets for model training and model evaluation. We show that results obtained on synthetic data can misestimate the actual model performance when it is deployed on real data. We hence advocate on the need for defining proper testing protocols in scenarios where differentially private synthetic datasets are utilized for model training and evaluation.

Mayana Wanderley Pereira, Rahul Dodhia, Juan Lavista Ferres
-
[ Visit Poster at Spot A6 in Virtual World ]

While recent automatic data augmentation works lead to state-of-the-art results, their design spaces and the derived data augmentation strategies still incorporate strong human priors. In this work, instead of selecting a set of hand-picked default augmentations alongside the searched data augmentations, we propose a fully automated approach for data augmentation search called Deep AutoAugment (DAA). We propose a search strategy that matches the directions of the validation gradients and the training gradients averaged over all possible augmentations. Our experiments show that DAA achieves strong performance on CIFAR-10/100 and SVHN with much less search cost compared to state-of-the-art data augmentation search methods.

Yu Zheng, Zhi Zhang, Shen Yan, Mi Zhang
-
[ Visit Poster at Spot B5 in Virtual World ]

We introduce Social Norm Bias (SNoB), a subtle but consequential type of discrimination that may be exhibited by machine learning classification algorithms, even when these systems achieve group fairness objectives. This work illuminates the gap between definitions of algorithmic group fairness and concerns of harm based on adherence to social norms. We study this issue through the lens of gender bias in occupation classification from online biographies. We quantify SNoB by measuring how an algorithm's predictions are associated with masculine and feminine gender norms. This framework reveals that for classification tasks related to male-dominated occupations, fairness-aware classifiers favor biographies whose language aligns with masculine gender norms. We compare SNoB across fairness intervention techniques, finding that post-processing interventions do not mitigate this bias at all.

Myra Cheng, Maria De-Arteaga, Lester Mackey, Adam Tauman Kalai
-
[ Visit Poster at Spot A1 in Virtual World ]

Data is central to the machine learning (ML) pipeline. While most existing works in the literature focus on challenges regarding the data used as inputs for model training, this work places emphasis on the data generated during model training and evaluation. Useful for robust evaluation and model benchmarking, we refer to this type of data as “benchmarking metadata”. As ML has become ubiquitous across domains and deployment settings, there is interest amongst various communities (e.g. industry practitioners) to benchmark models across tasks and objectives of personal value. However, this personalized benchmarking necessitates a framework that enables multi-objective evaluation (by collecting benchmarking metadata like performance metrics and training statistics) and ensures fair model comparisons (by controlling for confounding variables). To address these needs, we introduce the open-source Ludwig Benchmarking Toolkit (LBT), a system that enables the standardized and personalized collection of benchmarking metadata, with automated methods to remove confounding factors. We demonstrate how LBT can be used to create personalized benchmark studies with a large-scale comparative analysis for text classification across 7 models and 9 datasets. Using the benchmarking metadata generated by LBT, we explore trade-offs between inference latency and performance, relationships between dataset attributes and performance, and the effects of pretraining on convergence and robustness.

Avanika Narayan, Piero Molino, Karan Goel, Christopher Re
-
[ Visit Poster at Spot B4 in Virtual World ]

We describe a Bayesian approach to weakly supervised regression. Our proposed framework propagates uncertainty from the weak supervision to an aggregated predictive distribution. We use a generalized Bayes procedure to account for the supervision being weak and therefore likely misspecified.

Putra Manggala, Holger Hoos, Eric Nalisnick, Putra Manggala
-
[ Visit Poster at Spot C4 in Virtual World ]

Supply chain network data is a valuable asset for businesses wishing to understand their ethical profile, security of supply, and efficiency. Possession of a dataset alone however is not a sufficient enabler of actionable decisions due to incomplete dependency link information. In this paper, we present a graph representation learning approach to uncover hidden dependency links. To the best of our knowledge, our work is the first to represent a supply chain as a heterogeneous knowledge graph with learnable embeddings. We demonstrate that our representation facilitates state-of-the-art performance on link prediction of a global automotive supply chain network using a relational graph convolutional network. It is anticipated that our method will be directly applicable to businesses wishing to sever links with nefarious entities and mitigate risk of supply failure. More abstractly, it is anticipated that our method will be useful to inform representation learning of supply chain networks for downstream tasks beyond link prediction

Edward Kosasih, Ryan-Rhys Griffiths, Alexandra Brintrup, Ajmal Aziz
-
[ Visit Poster at Spot B1 in Virtual World ]

Recent advances in differentially private deep learning have demonstrated that application of differential privacy-- specifically the DP-SGD algorithm-- has a disparate impact on different sub-groups in the population, which leads to a significantly high drop-in model utility for sub-populations that are under-represented (minorities), compared to well-represented ones. In this work, we aim to compare PATE, another mechanism for training deep learning models using differential privacy, with DP-SGD in terms of fairness. We show that PATE does have a disparate impact too, however, it is much less severe than DP-SGD. We draw insights from this observation on what might be promising directions in achieving better fairness-privacy trade-offs.

Archit Uniyal, Rakshit Naidu, Sasikanth Kotti, Patrik Joslin Kenfack, Sahib Singh, FatemehSadat Mireshghallah
-
[ Visit Poster at Spot B0 in Virtual World ]

Depression is a serious medical illness that can have adverse effects on how one feels, thinks, and acts, which can lead to emotional and physical problems. Natural Language Processing (NLP) techniques can be applied to help with the diagnosis of such illnesses, using written peoples' utterances and writings. Due to the sensitive nature of such data, privacy measures need to be taken for handling and training models. In this work, we study the effects that Differential Privacy (DP) and Federated Learning (FL) have, on training contextualized language models (BERT, ALBERT, RoBERTa and DistilBERT), and offer insights on how to privately train NLP models. We envisage this work to be used in the healthcare/mental health industry to keep medical history private. Hence, we provide the open-source implementation of this work. To see the behavior of privacy implementations on the different datasets, the work is also implemented on a Sexual Harassment Twitter dataset.

Priyam Basu, Rakshit Naidu, Zumrut Muftuoglu, Sahib Singh, FatemehSadat Mireshghallah