Timezone: »
Poster
Data Shapley: Equitable Valuation of Data for Machine Learning
Amirata Ghorbani · James Zou
As data becomes the fuel driving technological and economic growth, a fundamental challenge is how to quantify the value of data in algorithmic predictions and decisions.
For example, in healthcare and consumer markets, it has been suggested that individuals should be compensated for the data that they generate, but it is not clear what is an equitable valuation for individual data.
In this work, we develop a principled framework to address data valuation in the context of supervised machine learning. Given a learning algorithm trained on $n$ data points to produce a predictor, we propose data Shapley as a metric to quantify the value of each training datum to the predictor performance. Data Shapley uniquely satisfies several natural properties of equitable data valuation. We develop Monte Carlo and gradient-based methods to efficiently estimate data Shapley values in practical settings where complex learning algorithms, including neural networks, are trained on large datasets. In addition to being equitable, extensive experiments across biomedical, image and synthetic data demonstrate that data Shapley has several other benefits: 1) it is more powerful than the popular leave-one-out or leverage score in providing insight on what data is more valuable for a given learning task; 2) low Shapley value data effectively capture outliers and corruptions; 3) high Shapley value data inform what type of new data to acquire to improve the predictor.
Author Information
Amirata Ghorbani (Stanford)
James Zou (Stanford University)
Related Events (a corresponding poster, oral, or spotlight)
-
2019 Oral: Data Shapley: Equitable Valuation of Data for Machine Learning »
Tue. Jun 11th 06:00 -- 06:20 PM Room Seaside Ballroom
More from the Same Authors
-
2021 : Meaningfully Explaining a Model's Mistakes »
· Abubakar Abid · James Zou -
2021 : Meaningfully Explaining a Model's Mistakes »
Abubakar Abid · James Zou -
2021 : MetaDataset: A Dataset of Datasets for Evaluating Distribution Shifts and Training Conflicts »
Weixin Liang · James Zou · Weixin Liang -
2021 : Have the Cake and Eat It Too? Higher Accuracy and Less Expense when Using Multi-label ML APIs Online »
Lingjiao Chen · James Zou · Matei Zaharia -
2021 : Machine Learning API Shift Assessments: Change is Coming! »
Lingjiao Chen · James Zou · Matei Zaharia -
2021 : Do Humans Trust Advice More if it Comes from AI? An Analysis of Human-AI Interactions »
Kailas Vodrahalli · James Zou -
2022 : On the nonlinear correlation of ML performance across data subpopulations »
Weixin Liang · Yining Mao · Yongchan Kwon · Xinyu Yang · James Zou -
2023 : Improve Model Inference Cost with Image Gridding »
Shreyas Krishnaswamy · Lisa Dunlap · Lingjiao Chen · Matei Zaharia · James Zou · Joseph Gonzalez -
2023 : Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value »
Yongchan Kwon · James Zou -
2022 : GSCLIP : A Framework for Explaining Distribution Shifts in Natural Language »
Zhiying Zhu · Weixin Liang · James Zou -
2022 : Evaluation of ML in Health/Science »
James Zou -
2022 : Data Sculpting: Interpretable Algorithm for End-to-End Cohort Selection »
Ruishan Liu · James Zou -
2022 : Data Budgeting for Machine Learning »
Weixin Liang · James Zou -
2022 Poster: When and How Mixup Improves Calibration »
Linjun Zhang · Zhun Deng · Kenji Kawaguchi · James Zou -
2022 Poster: Efficient Online ML API Selection for Multi-Label Classification Tasks »
Lingjiao Chen · Matei Zaharia · James Zou -
2022 Poster: Improving Out-of-Distribution Robustness via Selective Augmentation »
Huaxiu Yao · Yu Wang · Sai Li · Linjun Zhang · Weixin Liang · James Zou · Chelsea Finn -
2022 Spotlight: Efficient Online ML API Selection for Multi-Label Classification Tasks »
Lingjiao Chen · Matei Zaharia · James Zou -
2022 Spotlight: Improving Out-of-Distribution Robustness via Selective Augmentation »
Huaxiu Yao · Yu Wang · Sai Li · Linjun Zhang · Weixin Liang · James Zou · Chelsea Finn -
2022 Spotlight: When and How Mixup Improves Calibration »
Linjun Zhang · Zhun Deng · Kenji Kawaguchi · James Zou -
2021 Poster: Improving Generalization in Meta-learning via Task Augmentation »
Huaxiu Yao · Long-Kai Huang · Linjun Zhang · Ying WEI · Li Tian · James Zou · Junzhou Huang · Zhenhui (Jessie) Li -
2021 Spotlight: Improving Generalization in Meta-learning via Task Augmentation »
Huaxiu Yao · Long-Kai Huang · Linjun Zhang · Ying WEI · Li Tian · James Zou · Junzhou Huang · Zhenhui (Jessie) Li -
2021 Poster: How to Learn when Data Reacts to Your Model: Performative Gradient Descent »
Zachary Izzo · Lexing Ying · James Zou -
2021 Spotlight: How to Learn when Data Reacts to Your Model: Performative Gradient Descent »
Zachary Izzo · Lexing Ying · James Zou -
2020 Poster: A Distributional Framework For Data Valuation »
Amirata Ghorbani · Michael Kim · James Zou -
2019 Poster: Concrete Autoencoders: Differentiable Feature Selection and Reconstruction »
Muhammed Fatih Balın · Abubakar Abid · James Zou -
2019 Poster: Discovering Conditionally Salient Features with Statistical Guarantees »
Jaime Roquero Gimenez · James Zou -
2019 Oral: Discovering Conditionally Salient Features with Statistical Guarantees »
Jaime Roquero Gimenez · James Zou -
2019 Oral: Concrete Autoencoders: Differentiable Feature Selection and Reconstruction »
Muhammed Fatih Balın · Abubakar Abid · James Zou -
2018 Poster: CoVeR: Learning Covariate-Specific Vector Representations with Tensor Decompositions »
Kevin Tian · Teng Zhang · James Zou -
2018 Oral: CoVeR: Learning Covariate-Specific Vector Representations with Tensor Decompositions »
Kevin Tian · Teng Zhang · James Zou