Poster
in
Workshop: DMLR Workshop: Data-centric Machine Learning Research
Data Banzhaf: A Robust Data Valuation Framework for Machine Learning
Jiachen Wang · Ruoxi Jia
Data valuation, a growing field dedicated to measuring the usefulness of individual data sources in training machine learning (ML) models, plays a critical role in data-centric ML research; it has wide-ranging applications from improving data quality to incentivizing data sharing. This paper studies the robustness of data valuation techniques to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the Leave-one-out error) to produce inconsistent data value rankings across different runs. To address this challenge, we introduce the concept of safety margin, which measures the robustness of a data value notion. We show that \emph{the Banzhaf value}, a famous value notion that originated from cooperative game theory literature, achieves the largest safety margin among a large class of value notions. Our evaluation demonstrates that the Banzhaf value outperforms the existing semivalue-based data value notions on several ML tasks such as learning with weighted samples and noisy label detection. Overall, our study suggests that when the underlying ML algorithm is stochastic, the Banzhaf value is a promising alternative to the famous Shapley value given its computational advantage and ability to robustly differentiate data quality.