Helpful or Safe? UltraFeedback's Binarized Labels Encode a Value Tradeoff
Abstract
Open chatbots like Zephyr-7B and AllenAI's Tülu 3 are trained on preference data: pairs of model answers labeled "chosen" and "rejected." UltraFeedback is a widely used open preference dataset. Its raw labels score four axes (helpfulness, honesty, truthfulness, instruction-following), but a binarized version widely used in DPO-style training collapses them into one chosen-vs-rejected verdict per prompt. When helpfulness conflicts with safety, that collapse forces a value choice: the binarized label has to pick a winner. We ask whether UltraFeedback's binarized label reflects one universal notion of response quality, or one specific value resolution among several reasonable ones. We re-score 500 random UltraFeedback pairs with three independent reward models. Two general-purpose models (Skywork-Reward-V2-8B, trained without UF; ArmoRM-8B with its default gating scalar) reproduce UF's chosen-vs-rejected ordering at 73.0% and 74.6%. A third scorer, ArmoRM's BeaverTails-safety attribute head trained on a non-UF safety corpus, agrees only 45.2% of the time, below chance (p=0.033). The disagreement is rubric-specific: under helpful-leaning scorers the labels reproduce; under a safety-leaning scorer they do not. This is consistent with the labels encoding a particular helpful-vs-safe tradeoff rather than a universal quality signal, in line with Sorensen et al.'s argument that RLHF flattens disagreeing values into a single scalar. Four supporting analyses on UltraFeedback and HelpSteer2 characterize what is encoded.