Poster
in
Workshop: DMLR Workshop: Data-centric Machine Learning Research
Training with Low-Label-Quality Data: Rank Pruning and Multi-Review
Yue Xing · Ashutosh Pandey · David Yan · Fei Wu · Michael Fronda · Pamela Bhattacharya
Inaccurate labels in training data is a common problem in machine learning. Algorithms have been proposed to prune samples with label noise (i.e., samples are far from the decision boundary but still the label is inaccurate); training models on such samples could cause poor model performance. However, in many real applications, there exist samples around the decision boundary that are inherently difficult to label, leading to label error. Such samples are important for model training because of their high learning value. Existing pruning algorithm do not differentiate between samples with label noise and label error, therefore prunes both kinds of samples. This paper improves an existing pruning algorithm in two ways: it (a) prunes noisy samples and high-confidence samples (with less learning value), and (b) preserves the samples (potentially) with label error that have a high learning value and gets accurate labels for them (using multiple reviews). Our evaluation using publicly available and Meta internal de-identified and aggregated data sets shows that the combination of these ideas improve the baseline pruning algorithm.