Skip to yearly menu bar Skip to main content

Workshop: Workshop on Socially Responsible Machine Learning

An Empirical Investigation of Learning from Biased Toxicity Labels

Neel Nanda · Jonathan Uesato · Sven Gowal


Collecting annotations from human raters often results in a trade-off between the quantity of labels one wishes to gather and the quality of these labels. As such, it is only possible to gather a small amount of high-quality labels. In this paper, we study how different training strategies can leverage a small dataset of human-annotated labels and a large but noisy dataset of synthetically generated labels (which exhibit bias against identity groups) for predicting toxicity of online comments. We evaluate the accuracy and fairness properties of these approaches, and whether there is a trade-off. While we find that pre-training on all of the data and fine-tuning on clean data produces the most accurate models, we could not determine a single strategy that was better across all fairness metrics considered.

Chat is not available.