Poster
in
Workshop: DMLR Workshop: Data-centric Machine Learning Research
Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value
Yongchan Kwon · James Zou
Abstract:
Data valuation is a powerful framework for providing statistical insights into which data are beneficial or detrimental to model training. Many Shapley-based data valuation methods have shown promising results in various downstream tasks, however, they are well known to be computationally challenging as it requires training a large number of models. To address this issue, we propose Data-OOB, a new data valuation method for a bagging model that utilizes the out-of-bag estimate. The proposed method is computationally efficient. Specifically, Data-OOB takes less than $2.25$ hours on a single CPU processor when there are $10^6$ samples to evaluate and the input dimension is $100$. We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data, highlighting the potential for applying data values in real-world applications.
Chat is not available.