Timezone: »

On the Reproducibility of Data Valuation under Learning Stochasticity
Jiachen Wang · Feiyang Kang · Chiyuan Zhang · Ruoxi Jia · Prateek Mittal

Data valuation, which quantifies how individual data points contribute to machine learning (ML) model training, is an important question in data-centric ML research and has empowered a broad variety of applications. Popular data value notions such as the Shapley value are computed based on model performance scores trained on different data subsets. Recent studies, however, reveal that stochasticity in neural network training algorithms can adversely affect the consistency of data value rankings. Yet, how to effectively mitigate the impact of the actual perturbation arising from model training, remains an open question.This work introduces TinyMV, a new data value notion that is developed for improved reproducibility against stochasticity stemming from stochastic gradient descent (SGD) or its variants. TinyMV is inspired by a surprising yet consistent pattern of learning stochasticity from SGD: the signal-to-noise ratio (SNR) of a model’s performance change caused by the addition of a training point is maximized on very small datasets (e.g., <=15 data points for CIFAR10). Our experiments demonstrate that TinyMV exhibits state-of-the-art reproducibility and surpasses existing data valuation techniques across a broad range of applications.

Author Information

Jiachen Wang (Princeton University)
Feiyang Kang (Virginia Tech)
Chiyuan Zhang (MIT)
Ruoxi Jia (Virginia Tech)
Prateek Mittal (Princeton University)

More from the Same Authors