Right in the Right Way: Combining Verifiable Rewards with Human Demonstrations
Abstract
RL with verifiable rewards (RLVR) has emerged as a powerful paradigm for training LMs on tasks with verifiable signals, e.g.\ code generation and math reasoning. However, RLVR optimizes only what can be objectively scored, often neglecting subjective, non-verifiable aspects of human-like outputs, such as style and structure. This limitation leads to well-documented failure modes such as diversity collapse, unnatural-sounding responses, and over-optimization of proxy metrics. We propose an adversarial generator-discriminator framework that augments verifiable rewards with a learned signal from human demonstrations. A generator model is trained using RL to maximize both task accuracy and an adversarial reward derived from a discriminator. The discriminator, trained alongside the generator policy, learns to distinguish human-written outputs from model-generated ones. The discriminator serves as a learned proxy for the human output distribution, providing feedback on aspects of generation that are difficult to formalize as scalar rewards. Across diverse domains, including bug fixing and open-ended generation, our approach consistently improves non-verfiable properties while preserving the accuracy gains of RLVR. In bug fixing, our method produces solutions with significantly lower edit distance to human fixes compared to RLVR baselines while matching end performance. In story generation, our method significantly improves win rate while producing stories that are diverse and more human-like. Together, these results show that our approach bridges RL and SFT, offering a scalable path towards jointly optimizing the verifiable and non-verifiable properties of a task.