DRIVE: Best Data Scheduling Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation
Abstract
Recent success of large reasoning models (such as OpenAI o1 and DeepSeek R1) have spurred a resurgence of interest in reinforcement learning from verifiable rewards (RLVR). However, progress is still largely driven by RL algorithm design, while data scheduling -- the data-side decisions that determine what the model trains on over time -- is critical but remains underexplored. Therefore, data scheduling becomes the focus of this paper, including how to curate data for supervised fine-tuning (SFT) and how to select prompts and collect rollouts for reinforcement learning (RL). We introduce a pipeline with careful designs on data scheduling, consisting of hardness-prioritized SFT and two-stage RL. Specifically, we first fine-tune the base model on supervision data that is curated to prioritize difficulty based on both arena learning and classification. Then, we introduce two-stage RL where a decreased max sequence length during rollout is used in the first stage to expand entropy and reduce repetition, and a large number rollouts per prompt and curriculum design are adopted in the second stage to encourage exploration for challenging problems. We implement this pipeline on Qwen2.5-32B and an internal 389B MoE model, and evaluate them on a wide range of benchmarks including challenging LeetCode and Codeforces weekly contests. The results not only indicate the effectiveness and scalability of our pipeline but also demonstrate our model achieve sota of 32B models in competitive code generation.