Timezone: »
Poster
Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value
Yongchan Kwon · James Zou
Data valuation is a powerful framework for providing statistical insights into which data are beneficial or detrimental to model training. Many Shapley-based data valuation methods have shown promising results in various downstream tasks, however, they are well known to be computationally challenging as it requires training a large number of models. As a result, it has been recognized as infeasible to apply to large datasets. To address this issue, we propose Data-OOB, a new data valuation method for a bagging model that utilizes the out-of-bag estimate. The proposed method is computationally efficient and can scale to millions of data by reusing trained weak learners. Specifically, Data-OOB takes less than $2.25$ hours on a single CPU processor when there are $10^6$ samples to evaluate and the input dimension is $100$. Furthermore, Data-OOB has solid theoretical interpretations in that it identifies the same important data point as the infinitesimal jackknife influence function when two different points are compared. We conduct comprehensive experiments using 12 classification datasets, each with thousands of sample sizes. We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data and finding a set of helpful (or harmful) data points, highlighting the potential for applying data values in real-world applications.
Author Information
Yongchan Kwon (Columbia University)
James Zou (Stanford)
More from the Same Authors
-
2021 : Stateful Performative Gradient Descent »
Zachary Izzo · James Zou · Lexing Ying -
2022 : On the nonlinear correlation of ML performance across data subpopulations »
Weixin Liang · Yining Mao · Yongchan Kwon · Xinyu Yang · James Zou -
2022 : MetaShift: A Dataset of Datasets for Evaluating Contextual Distribution Shifts »
Weixin Liang · Xinyu Yang · James Zou -
2022 : Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning »
Weixin Liang · Yuhui Zhang · Yongchan Kwon · Serena Yeung · James Zou -
2023 : Last-Layer Fairness Fine-tuning is Simple and Effective for Neural Networks »
Yuzhen Mao · Zhun Deng · Huaxiu Yao · Ting Ye · Kenji Kawaguchi · James Zou -
2023 : Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value »
Yongchan Kwon · James Zou -
2023 : Prospectors: Leveraging Short Contexts to Mine Salient Objects in High-dimensional Imagery »
Gautam Machiraju · Arjun Desai · James Zou · Christopher Re · Parag Mallick -
2023 : Beyond Confidence: Reliable Models Should Also Consider Atypicality »
Mert Yuksekgonul · Linjun Zhang · James Zou · Carlos Guestrin -
2023 : Less is More: Using Multiple LLMs for Applications with Lower Costs »
Lingjiao Chen · Matei Zaharia · James Zou -
2023 Poster: Data-Driven Subgroup Identification for Linear Regression »
Zachary Izzo · Ruishan Liu · James Zou -
2023 Poster: Accuracy on the Curve: On the Nonlinear Correlation of ML Performance Between Data Subpopulations »
Weixin Liang · Yining Mao · Yongchan Kwon · Xinyu Yang · James Zou -
2023 Poster: Discover and Cure: Concept-aware Mitigation of Spurious Correlation »
Shirley Wu · Mert Yuksekgonul · Linjun Zhang · James Zou -
2022 : Invited talk #2 James Zou (Title: Machine learning to make clinical trials more efficient and diverse) »
James Zou -
2022 : 7-UP: generating in silico CODEX from a small set of immunofluorescence markers »
James Zou -
2022 : Contributed Talk 2: MetaShift: A Dataset of Datasets for Evaluating Contextual Distribution Shifts »
Weixin Liang · Xinyu Yang · James Zou -
2022 Poster: Meaningfully debugging model mistakes using conceptual counterfactual explanations »
Abubakar Abid · Mert Yuksekgonul · James Zou -
2022 Spotlight: Meaningfully debugging model mistakes using conceptual counterfactual explanations »
Abubakar Abid · Mert Yuksekgonul · James Zou -
2019 Poster: Adaptive Monte Carlo Multiple Testing via Multi-Armed Bandits »
Martin Zhang · James Zou · David Tse -
2019 Oral: Adaptive Monte Carlo Multiple Testing via Multi-Armed Bandits »
Martin Zhang · James Zou · David Tse -
2017 Poster: Estimating the unseen from multiple populations »
Aditi Raghunathan · Greg Valiant · James Zou -
2017 Poster: Learning Latent Space Models with Angular Constraints »
Pengtao Xie · Yuntian Deng · Yi Zhou · Abhimanu Kumar · Yaoliang Yu · James Zou · Eric Xing -
2017 Talk: Learning Latent Space Models with Angular Constraints »
Pengtao Xie · Yuntian Deng · Yi Zhou · Abhimanu Kumar · Yaoliang Yu · James Zou · Eric Xing -
2017 Talk: Estimating the unseen from multiple populations »
Aditi Raghunathan · Greg Valiant · James Zou