Reinforcement Learning for Non-Verifiable Problems
Gurusha Juneja ⋅ Shubham Phal ⋅ Jennifer She ⋅ Lisa Wang ⋅ Dorsa Sadigh ⋅ Anca Dragan ⋅ William Wang
Abstract
Many real-world tasks are non-verifiable—there is no objective ground truth, and quality must be judged subjectively—making reward design for RL difficult. Existing approaches based on scalar rubric scores or single comparisons are often noisy, poorly calibrated, or provide sparse learning signals. We introduce Tournament Style RL (TSRL), which constructs rewards from rubric-guided pairwise judgments against a fixed set of anchor responses, using win-rate as the reward for policy optimization. This aggregation of comparisons against anchor responses yields a signal that is more robust to the judge noise by stabilizing the reference frame, reducing the variance in reward. We test across four non-verifiable tasks and two backbone LLMs, and find that TSRL improves average win-rate by $+43.8$ points over the base model and $+22.8$ points over the strongest baseline. TSRL scales with the number of anchors, remains robust under weak or partially corrupted judges, the results are supported by blinded human preference studies.
Successful Page Load