Rubric Curriculum RL: Exploiting the Generation-Verification Gap in Creative Writing
Abstract
Reinforcement learning with verifiable rewards (RLVR) on foundation models has led to significant improvements in math and code generation. Extending these gains to open-ended domains remains challenging: ground-truth verification is unavailable, human annotation is expensive, and learnt reward models are prone to reward hacking. We introduce Rubric Curriculum RL (RcRL), a self-improvement method for creative short-fiction writing that requires no new data, human annotations, or stronger teacher models. RcRL exploits the generation-verification gap: it is easier to judge whether work is creative than to produce something creative. While this gap exists across open-ended domains, exploiting it for RL is challenging due to reward hacking. During training, we use pairwise preferences against a curriculum of rubric criteria, which provides a more stable signal than absolute scoring while reducing reward hacking against a stationary objective. Unlike baseline methods, which plateau or collapse within a few dozen steps, our approach preserves output entropy and shows improvements over 1000+ training steps. In human evaluations, RcRL-trained models achieve a 70.5% win rate and demonstrate consistent gains across multiple creative writing benchmarks and judges.